Ticket #4131 (closed defect: invalid)

Opened 7 years ago

Last modified 4 years ago

[firmware] WLAN appears to die after some number of suspend/resumes

Reported by: wad Owned by: wad
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: wireless Version:
Keywords: XO-1, libertas Cc:
Action Needed: never set Verified: no
Deployments affected: Blocked By:
Blocking:

Description

When running the ping test for #1752, we see units which stop responding (resuming) to pings, but which respond fine if the unit is woken manually (e.g. with the power button). We have managed to stop the suspend/resume cycling on some machines without crashing the kernel (a difficult feat!), and the system thinks that it has a fully operational mesh interface: it still shows up on iwconfig and ifconfig. Attempts to ping a laptop in this state, or to ping another mesh node from it fail w. 100% packet loss.

This problem has been seen on at least four different machines, typically after running between 10,000 and 20,000 cycles of suspend/resume. All machines were running build 581, w. OFW q2c28.

This may be related to a similar problem reported in ticket #3738

Change History

follow-up: ↓ 4   Changed 7 years ago by wad

All the machines this was seen on were modified to have C2 circuitry. If testing for this bug using a B4/C1 laptop, other problems (such as #1835) are likely to dominate, obscuring the search for this condition.

  Changed 7 years ago by rchokshi

  • cc gr-wireless-olpc@… added

Which firmware version are you using? v5.110.18.p2?

follow-up: ↓ 5   Changed 7 years ago by wad

Don't know. It has size 118288 bytes, with an MD5 sum of:

2a52fe929fd725d0577bc4f719b279f6

in reply to: ↑ 1   Changed 7 years ago by rsmith

Replying to wad:

iwconfig and ifconfig. Attempts to ping a laptop in this state, or to ping another mesh node from it fail w. 100% packet loss.

Try to duplicate this on a laptop with an EC connector loaded so we can see if the EC is actually issuing SCIs.

in reply to: ↑ 3   Changed 7 years ago by jcardona

Replying to wad:

Don't know. It has size 118288 bytes, with an MD5 sum of: 2a52fe929fd725d0577bc4f719b279f6

MD5 (usb8388-17p2-with-resume-fixes.bin) = 2a52fe929fd725d0577bc4f719b279f6

This was a private build based on 17.p2. Since then there have been two new releases. In particular, 17.p5 fixes a bug that may be related to this:

According to the wireless firmware release notes:

FW release W8388-5.110.17.p5
(...)
Bug Fixes
----------
(...)
(2) Fixed buffer corruption on suspend resume.

  Changed 7 years ago by kimquirk

These machines are running 5.110.17.p3.

follow-up: ↓ 8   Changed 7 years ago by jcardona

  • owner changed from rchokshi to jcardona
  • status changed from new to assigned

The following was observed on one of the devices under test that failed:

  1. dut does not wake up via wireless.
  2. dut does wake up via power button.
  3. while dut awake it cannot ping other nodes in infrastructure.
  4. sniffer reveals that ARP packets are being transmitted, but no response is received.
  5. after populating the arp table manually ICMP echo requests starts being transmitted (dut MAC: 00:17:c4:05:2d:fa)
sniff@sniff:~$ sudo tcpdump -i ath1 -s 1500 -e | grep -i SA:00:17:c4:05:2d:fa

15:35:38.328464 DA:00:13:10:7d:b5:fb (oui Unknown) SA:00:17:c4:05:2d:fa (oui Unknown) BSSID:02:2f:af:93:ef:6b (oui Unknown) LLC, dsap SNAP (0xaa), ssap SNAP (0xaa), cmd 0x03: oui Ethernet (0x000000), ethertype IPv4 (0x0800): 192.168.5.62 > 192.168.5.2: ICMP echo request, id 19724, seq 514, length 64
  1. sniffer reveals no reply is transmitted to the dut:
sniff@sniff:~$ sudo tcpdump -i ath1 -s 1500 -e | grep -i RA:00:17:c4:05:2d:fa

15:37:23.590191 RA:00:17:c4:05:2d:fa (oui Unknown) Acknowledgment
<only 802.11 acknowledgments>
  1. sniffer reveals that AP (b5:fb), instead sending data to dut, is sending it to another AP (b5:fd). Lazy-WDS can cause routing loops, and this appears to be what we see here:
sniff@sniff:~$ sudo tcpdump -i ath1 -s 1500 -e | grep -i DA:00:17:c4:05:2d:fa

15:29:50.253762 RA:00:17:c4:05:2c:e3 (oui Unknown) TA:00:13:10:7d:b5:fd (oui Unknown) DA:00:17:c4:05:2d:fa (oui Unknown) SA:00:13:10:7d:b5:fb (oui Unknown) LLC, dsap Unknown (0x10), ssap Unknown (0xfa), cmd 0xaaaa: Information, send seq 85, rcv seq 85, Flags [Command, Poll], length 94

This is certainly one problem. If this is the only problem causing the symptoms described in this ticket, pinging over the mesh should work (the description of the ticket says it does not).

Can someone verify?

Suggestion: Can't we blacklist this AP? Or provide a firmware image for it (this model's firmware can be replaced by OpenWRT) with Lazy-WDS disabled? The Lazy-WDS "feature" on the WRT54G is not based on any standard, not documented and has well known weaknesses.

in reply to: ↑ 7   Changed 7 years ago by jcardona

Replying to jcardona:

This is certainly one problem. If this is the only problem causing the symptoms described in this ticket, pinging over the mesh should work (the description of the ticket says it does not). Can someone verify?

Ronak did. He was able to ping another mesh point from an xo that exhibited the symptoms of this ticket.

  Changed 7 years ago by jcardona

  • owner changed from jcardona to wad
  • status changed from assigned to new

Please, wad, close this ticket if the explanations above are in line with your observations.

  Changed 7 years ago by rchokshi

Wad tried to replicate this problem again in the afternoon. This time, there was no way to break the suspend/resume cycle and come to the command prompt on the concerned XO. So, I couldnt test whether the ping works on the mesh interface or not. We will have to take a look at this again tomorrow with an XO (or a couple of XOs if possible) with JTAG connected on it.

  Changed 7 years ago by kimquirk

  • keywords mesh, relnote added; mesh removed
  • status changed from new to closed
  • resolution set to wontfix

This bug needs to be documented and put in the release note about what might happen and how to recover if your XO(s) hangs.

In a private (G1G1) setting, you need to reboot your access point(s) to recover.

In a school server setting, where Linksys access points are present, all the XOs could go down. You would have to reboot all the Linksys APs in the school AND/OR we can recommend downloading firmware that can configure WDS in the linksys (look here http://wiki.openwrt.org/OpenWrtDocs/Configuration#head-aba1228974499bb5dcaffdb2c3d45b07bcab2013)

  Changed 6 years ago by cjb

  • status changed from closed to reopened
  • resolution deleted

We've just seen this on several of our pre-MP testbed machines, after as few as 1000 cycles. Unit doesn't wake up via wireless, does wake up by power, interface is present and iwpriv eth0 fwt_time increments, but can not ping in infra mode.

Reopening. I will try to leave the machine up for as long as possible so that Javier can gain debugging information from it if he'd like to.

  Changed 6 years ago by cjb

(Using 19.p0 plus Javier's beacons-in-infra-mode changes.)

  Changed 6 years ago by cjb

Actually, it happened on every single machine (there are eight) that we started four hours ago. They're all in the same state.

  Changed 6 years ago by jcardona

The instance of this problem that I observed last week was caused by the AP not sending traffic to the xo's. If you send me the MAC addresses of the nodes that are not waking up I could check if this is again the same problem.

Alternatively you could try to wake up the nodes by sending mesh traffic (which bypasses the AP). If that wakes the node up, we would know that this is the same problem described above.

follow-up: ↓ 18   Changed 6 years ago by cjb

Hi Javier,

We're in China, so you don't have sniffer access. We think the AP was sending traffic to the XOs, because other XOs in the room were resuming, and because these XOs started to suspend/resume successfully again once we rebooted them. Also, when we woke them up by hand, we tried to ping the AP, and couldn't get any response from it despite it still being there.

Could you let me know how to send mesh traffic to the nodes, please?

I left one machine turned on and still in the bad state. Is there anything we can do on it for you?

  Changed 6 years ago by wad

Not quite true, I can sniff with my Mac. But we do need to know your spell for sending mesh traffic to a node (that has been configured manually in managed mode.)

in reply to: ↑ 16   Changed 6 years ago by jcardona

Replying to cjb:

Hi Javier, We're in China, so you don't have sniffer access.

You can capture traffic with any xo.

We think the AP was sending traffic to the XOs, because other XOs in the room were resuming, and because these XOs started to suspend/resume successfully again once we rebooted them.

What I had observed before is that the AP was in the bad state, not the xo. In particular, traffic addressed to one particular xo was sent to a different one. Other nodes were unaffected. That seems to be consistent with your observation. Are you still using the Linksys WRT54G?

Also, when we woke them up by hand, we tried to ping the AP, and couldn't get any response from it despite it still being there.

If the AP is in the bad state that I described above, some nodes should still be able to ping it.

Could you let me know how to send mesh traffic to the nodes, please?

1. Associate the source xo (Y) with the same AP that the suspended xo (X) is associated with and bring up the mesh interface. The association is just to be sure that both are in the same channel: the AP will not be used to send traffic.

  # iwconfig eth0 mode managed essid <your_ssid>
  # ifconfig msh0 192.168.5.Y

2. Manually populate the arp table on the ping source

  # arp -s <192.168.5.X> <xo_Y_mac>

3. Ping

  # ping 192.168.5.X

I left one machine turned on and still in the bad state. Is there anything we can > do on it for you?

Wake up manually and try to ping other nodes over mesh. Does that work? Can you sniff from another xo and see if any is traffic transmitted?

follow-up: ↓ 20   Changed 6 years ago by dwmw2

  • cc dwmw2 added
  • summary changed from WLAN appears to die after some number of suspend/resumes to [firmware] WLAN appears to die after some number of suspend/resumes

I suspect this is now fixed?

in reply to: ↑ 19   Changed 6 years ago by mbletsas

Replying to dwmw2:

I suspect this is now fixed?

This is for John to decide.

  Changed 6 years ago by gregorio

  • keywords mesh added; mesh, relnote removed
  • next_action set to never set

Came across this one when reviewing bugs with relnote in the keyword field.

Is this fixed? If so please close it. No release note planned on it right now.

Thanks,

Greg S

  Changed 4 years ago by wad

  • cc javier@…, mbletsas, wad, gr-wireless-olpc@…, dwmw2 removed
  • keywords XO-1, libertas added; mesh removed
  • status changed from reopened to closed
  • resolution set to invalid

Javier's theory for what was going here seems to be the best.

I still suspect that the WLAN on XO-1 appears to die after some number of suspend/resumes, but it should be opened as a new bug if seen with current software.

Note: See TracTickets for help on using tickets.