Opened 7 years ago

Closed 6 years ago

#6589 closed defect (fixed)

xo stops responding to mesh path requests frames

Reported by: jcardona Owned by: dwmw2
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: wireless Version:
Keywords: Cc: ashishs@…, carrano@…, rbhagwat@…, mbletsas@…, rchokshi@…, mstone, cjb, wad, yani, dsaxena
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: yes

Description

In this failure one xo stops responding to path request messages.
The capture shows xo:4f:6a trying to discover a path to xo:53:04
ping fails as it would be expected.

Communication is established after manually adding an entry in the forwarding table of xo:53:04 (via iwpriv msh0 fwt_add)

The attached capture shows this failure. A summary pasted below:

    486 4.090182    QuantaMi_11:4f:6a     Broadcast             PREQ
    487 4.090678    QuantaMi_11:4f:6a     Broadcast             PREQ
   (...)
   1669 9.101435    QuantaMi_11:4f:6a     Broadcast            PREQ
   1884 10.100766   QuantaMi_11:4f:6a     Broadcast             PREQ
   1885 10.101887   QuantaMi_11:4f:6a     Broadcast             PREQ
   2828 16.286484   QuantaMi_0d:53:04     Broadcast             PREQ
   2829 16.286984   QuantaMi_0d:53:04     Broadcast             PREQ
   2830 16.287485   QuantaMi_0d:53:04     Broadcast             PREQ
   2831 16.288610   QuantaMi_0d:53:04     Broadcast             PREQ
   2861 16.304364   QuantaMi_11:4f:6a     QuantaMi_0d:53:04     PREP

The manual addition happens at around second 16.

Attachments (4)

mesh-path-discovery-failure.txt (29.9 KB) - added by jcardona 7 years ago.
traffic capture
mesh-fails-then-works.cap (3.1 KB) - added by jcardona 7 years ago.
wireshark dump
xo-itself.pcap (343.9 KB) - added by carrano 7 years ago.
capture from a failing XO
fromsniffer.pcap (805.2 KB) - added by carrano 7 years ago.
simultaneous with xo-itself.pcap, but taken from another sniffer

Download all attachments as: .zip

Change History (28)

Changed 7 years ago by jcardona

traffic capture

Changed 7 years ago by jcardona

wireshark dump

comment:1 Changed 7 years ago by jcardona

  • Component changed from distro to wireless

firmware version: 5.110.22.p1
update.1.693

comment:2 Changed 7 years ago by jcardona

Forgot to mention that reseting the device recovered mesh connectivity. On the particular instance that I observed the problem came back after a few minutes.

comment:3 follow-up: Changed 7 years ago by ashish

Can someone check, whether XO's firmware actually receives PREQs by enabling monitor mode?
Does this problem appear on earlier version of the firmware as well?

comment:4 in reply to: ↑ 3 Changed 7 years ago by carrano

Replying to ashish:

Can someone check, whether XO's firmware actually receives PREQs by enabling monitor mode?
Does this problem appear on earlier version of the firmware as well?

It happens to 20.p49 too.

comment:5 Changed 7 years ago by mbletsas

  • Cc "Ashish Shukla" <ashishs@…> Ricardo Carrano <carrano@…> "Rajesh Bhagwat" <rbhagwat@…> "Michail Bletsas" <mbletsas@…> "Ronak Chokshi" <rchokshi@…> added
  • Milestone changed from Never Assigned to Update.1
  • Priority changed from normal to blocker
  • Verified set

comment:6 Changed 7 years ago by mbletsas

  • Cc ashishs@… carrano@… rbhagwat@… mbletsas@… rchokshi@… added; "Ashish Shukla" <ashishs@…> Ricardo Carrano <carrano@…> "Rajesh Bhagwat" <rbhagwat@…> "Michail Bletsas" <mbletsas@…> "Ronak Chokshi" <rchokshi@…> removed

comment:7 Changed 7 years ago by carrano

This seems more general than we thought:

A capture taken from a failing XO (02:2f:59) shows no RREQ addressed to it (while there are many on the air):

(wlan_mgt.fixed.mesh_action == 0x0000) and (wlan_mgt.fixed.da == 00:17:c4:02:2f:59) -> returns nothing

But the capture actually shows no frame addressed to it, at all!

(wlan.da == 00:17:c4:02:2f:59) -> returns nothing

OTOH, a capture from a sniffer in another machine, shows many frames addressed to the failing node (including rreqs)..

Refer to attached captures (xo-itself and fromsniffer.pcap)

Changed 7 years ago by carrano

capture from a failing XO

Changed 7 years ago by carrano

simultaneous with xo-itself.pcap, but taken from another sniffer

comment:8 Changed 7 years ago by carrano

The probability of crash with stock 695 (fw 22p1) in my lab is around 20%.

  • Combos we know to break:

-- stock 695
-- 695 with fw 20p49
-- 695 with fw 22p6

  • Combos that I could _not_ break:
  • stock 653 (with 20p47) - 3 different XOs, total of 26 repetitions (chance of crash would be 99.7%)
  • 653 with 22p1 firmware - 3 different XOs, total of 21 repetitions (chance of crash would be 99,1%)

It's is extremely probable, then, that this is somehow related to the driver present in 695.

comment:9 Changed 7 years ago by cjb

  • Owner changed from ashish to dwmw2

Dave, Michail says the driver is definitely part of the problem here.

comment:10 follow-up: Changed 7 years ago by carrano

In all instances of this bug I've seen, either this happens at start time or it doesn't happen at all. I could never cause it by stressing. Stressing, here includes but is not limited to:

  • Reducing route expiration to 1s and generating heavy traffic
  • Generating saturating traffic.
  • Running day long tests.

Only a reboot changes the status of an XO from non-6598 to 6598 or vice versa.

comment:11 Changed 7 years ago by carrano

A node will (eventually) recover with the following procedure:

  1. rmmod usb8xxx
  1. modprobe usb8xxx
  1. Go to the mesh view, disconnect and reconnect to mesh 1 (clicking on the circle).

A node _cannot_ be broken by the same procedure.

Tried on 3 XOs, 8 times on each. (99.6% would be the probability of at least one node failing in this experiment)

(this comment is based on tests with stock build 702)

comment:12 Changed 7 years ago by mstone

  • Cc mstone cjb wad added

comment:13 in reply to: ↑ 10 ; follow-ups: Changed 7 years ago by ashish

Replying to carrano:

In all instances of this bug I've seen, either this happens at start time or it doesn't happen at all. I could never cause it by stressing. Stressing, here includes but is not limited to:

  • Reducing route expiration to 1s and generating heavy traffic
  • Generating saturating traffic.
  • Running day long tests.

Only a reboot changes the status of an XO from non-6598 to 6598 or vice versa.

As per my understanding of the current driver, the probability of this failure during reboot should be very high. The current libertas driver does not seem to wait after firmware download, and therefore, any command issued just after firmware download will result failure, if firmware is not yet initialized.
I dont see this problem with the following changes,
http://dev.laptop.org/git?p=olpc-2.6;a=blob;f=drivers/net/wireless/libertas/if_usb.c;h=78f20baa780a4b302f4f5abb1ba9d507f820d77c;hb=stable

if_usb_probe
/* Upload firmware */
224         if (if_usb_prog_firmware(cardp)) {
225                 lbs_deb_usbd(&udev->dev, "FW upload failed\n");
226                 goto err_prog_firmware;
227         }
228 
229         if (!(priv = lbs_add_card(cardp, &udev->dev)))
230                 goto err_prog_firmware;
231 
232         cardp->priv = priv;
            msleep_interruptible(200); /*** delay to make sure firmware is ready */         cardp->priv->fw_ready = 1;
234 
235         priv->hw_host_to_card = if_usb_host_to_card;

We will add firmware ready event, which can be used by the driver, in the future wireless firmware release to make this more robust.

comment:14 in reply to: ↑ 13 Changed 7 years ago by carrano

As per my understanding of the current driver, the probability of this failure during reboot should be very high.

Yes, I would say (empirically) that this probability lies between 20 and 30% (very high).

comment:15 in reply to: ↑ 13 Changed 7 years ago by dilinger

Replying to ashish:

Replying to carrano:

In all instances of this bug I've seen, either this happens at start time or it doesn't happen at all. I could never cause it by stressing. Stressing, here includes but is not limited to:

  • Reducing route expiration to 1s and generating heavy traffic
  • Generating saturating traffic.
  • Running day long tests.

Only a reboot changes the status of an XO from non-6598 to 6598 or vice versa.

As per my understanding of the current driver, the probability of this failure during reboot should be very high. The current libertas driver does not seem to wait after firmware download, and therefore, any command issued just after firmware download will result failure, if firmware is not yet initialized.
I dont see this problem with the following changes,
http://dev.laptop.org/git?p=olpc-2.6;a=blob;f=drivers/net/wireless/libertas/if_usb.c;h=78f20baa780a4b302f4f5abb1ba9d507f820d77c;hb=stable

if_usb_probe
/* Upload firmware */
224         if (if_usb_prog_firmware(cardp)) {
225                 lbs_deb_usbd(&udev->dev, "FW upload failed\n");
226                 goto err_prog_firmware;
227         }
228 
229         if (!(priv = lbs_add_card(cardp, &udev->dev)))
230                 goto err_prog_firmware;
231 
232         cardp->priv = priv;
            msleep_interruptible(200); /*** delay to make sure firmware is ready */ 

Why add this here? Is the problem that we're not waiting long enough after loading the firmware, or that we're not waiting long enough after initializing the card?

Unless there's a reason why we must wait after initializing the card, I'd rather see the msleep end up in if_usb_prog_firmware().

cardp->priv->fw_ready = 1;

234
235 priv->hw_host_to_card = if_usb_host_to_card;
}}}

We will add firmware ready event, which can be used by the driver, in the future wireless firmware release to make this more robust.

comment:16 in reply to: ↑ 13 ; follow-up: Changed 7 years ago by dilinger

Stupid trac. Let's try this again...

Replying to ashish:

I dont see this problem with the following changes,

if_usb_probe
/* Upload firmware */
224         if (if_usb_prog_firmware(cardp)) {
225                 lbs_deb_usbd(&udev->dev, "FW upload failed\n");
226                 goto err_prog_firmware;
227         }
228 
229         if (!(priv = lbs_add_card(cardp, &udev->dev)))
230                 goto err_prog_firmware;
231 
232         cardp->priv = priv;
            msleep_interruptible(200);

Why add this here? Is the problem that we're not waiting long enough after loading the firmware, or that we're not waiting long enough after initializing the card?

Unless there's a reason why we must wait after initializing the card, I'd rather see the msleep end up in if_usb_prog_firmware().

comment:17 in reply to: ↑ 16 Changed 7 years ago by ashish

Replying to dilinger:

Stupid trac. Let's try this again...

Replying to ashish:

I dont see this problem with the following changes,

if_usb_probe
/* Upload firmware */
224         if (if_usb_prog_firmware(cardp)) {
225                 lbs_deb_usbd(&udev->dev, "FW upload failed\n");
226                 goto err_prog_firmware;
227         }
228 
229         if (!(priv = lbs_add_card(cardp, &udev->dev)))
230                 goto err_prog_firmware;
231 
232         cardp->priv = priv;
            msleep_interruptible(200);

Why add this here? Is the problem that we're not waiting long enough after loading the firmware,

Yes.

or that we're not waiting long enough after initializing the card?

Unless there's a reason why we must wait after initializing the card, I'd rather see the msleep end up in if_usb_prog_firmware().

comment:18 Changed 7 years ago by carrano

Tests with the suggested patch used in conjunction with firmware release 22p9 (under test at the moment) showed no occurrence of this bug.

I'll keep this open until we have the fix in the kernel and the firmware approved (#6869).

But we have good reasons to believe it is fixed.

Root causes were (quoting Ashish):

6589 Issue 1

Driver does not wait long enough after firmware download.

1.) Firmware ready event (0x30) has been added to detect whether firmware is up or not.

2.) Alternatively driver should wait long enough. In practice it will be good to combine these two in the driver to allow compatibility with old firmware versions which don't send such event.

6589 issue 2

There was a resource (timer) leak in mesh stop logic.

Repeated mesh start/stop eventually broken mesh routing as all the timers used for mesh were leaked.

As per driver code, which I am using currently, it seems when we change channel, driver issues
MESH_CONFIG (stops mesh), channel switch and MESH_CONFIG (mesh start) in a sequence.

Therefore, repeated channel change, NM may be one such application, exhausted all the timers used for mesh and mesh routing failed.

A quick way to get 6589 state.
Repeat the following ~10 times and you should see 6589 with old firmware.

iwconfig eth0 channel 1
iwconfig eth0 channel 11

Properly releasing timers, which are used for mesh routing during mesh stop, fixes above issue.

comment:19 follow-up: Changed 7 years ago by dwmw2

We should already be waiting for 200ms after sending the final block of the firmware.
That wait_event_interruptible() will stay asleep until woken either by the if_usb_fw_timeo() function, or by if_usb_receive_fwload() upon receiving the FIRMWARE_READY event.

We reset the timer to 200ms immediately before setting cardp->fwdnldover after sending the final block (again, in if_usb_receive_fwload(), so execution in if_usb_prog_firmware() shouldn't resume sooner.

The only possible exception I see is if it's interrupted by a signal while it's waiting for the event. We should probably change that wait_event_interruptible() to just wait_event().

Or is it that 200ms is not enough, and we just need to be waiting for longer?

comment:20 in reply to: ↑ 19 Changed 7 years ago by ashish

Replying to dwmw2:

We should already be waiting for 200ms after sending the final block of the firmware.
That wait_event_interruptible() will stay asleep until woken either by the if_usb_fw_timeo() function, or by if_usb_receive_fwload() upon receiving the FIRMWARE_READY event.

We reset the timer to 200ms immediately before setting cardp->fwdnldover after sending the final block (again, in if_usb_receive_fwload(), so execution in if_usb_prog_firmware() shouldn't resume sooner.

The only possible exception I see is if it's interrupted by a signal while it's waiting for the event. We should probably change that wait_event_interruptible() to just wait_event().

Or is it that 200ms is not enough, and we just need to be waiting for longer?

Unfortunately, I missed this part of the code. 200ms wait is long enough when driver could not receive FIRMWARE_READY event. In the current implementation there is possiblity of firmware bug as I could see HW_SPEC (0x3) command echoed back without any change in command parameters. This might be becuase as soon as driver gets FIRMWARE_READY it resumes card initialization and issues HW_SPEC (0x3) command and firmware might have failed to send back proper response.
Interestingly, enabling libertas debug log during XO boot-up always gives expected result (proper MAC and other information). So adding a small delay (either in form of debug log o/p delay or fixed wait 200ms, can be much lower but not experimented) after firmware download somehow prevents this race.
I will update here with more information shortly.

comment:21 Changed 7 years ago by ashish

Firmware version 5.110.22.p10 fixed a race resulting incorrect response to the command GET_HW_SPEC (0x3). The symptom were 0xFF:FF:FF:FF:FF:FF interface addresses, all-zero firmare version reported by GET_HW_SPEC.

comment:22 Changed 6 years ago by yani

  • Cc yani added

comment:23 Changed 6 years ago by dsaxena

  • Cc dsaxena added

comment:24 Changed 6 years ago by carrano

  • Resolution set to fixed
  • Status changed from new to closed

With firmware version 5.110.22.p14 released, this can be closed. It is fixed.

Note: See TracTickets for help on using tickets.