Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#2621 closed defect (fixed)

wireless suspend/resume failure under high-traffic

Reported by: marcelo Owned by: rchokshi
Priority: high Milestone: Update.1
Component: wireless Version:
Keywords: power Cc: GR-Wireless-OLPC@…, javier@…, david-b@…, jg, marcelo, dilinger
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: no

Description

Wireless suspend/resume fails during high-traffic.

To reproduce the problem do (on a normal machine):

# ping -f xo-laptop

And then on the laptop:

# echo mem > /sys/power/state

Attachments (3)

BadTrace#1.usb (488.6 KB) - added by rchokshi 7 years ago.
BadTrace#1.usb
BadTrace#2.usb (655.9 KB) - added by rchokshi 7 years ago.
BadTrace#2.usb
GoodTrace#1.usb (9.0 MB) - added by rchokshi 7 years ago.
Good Trace

Change History (41)

comment:1 Changed 7 years ago by jg

  • Milestone changed from Untriaged to Trial-3

comment:2 Changed 7 years ago by rchokshi

  • Cc GR-Wireless-OLPC@… javier@… added

Clarification from Marcelo:

On suspend we:

1) stop all incoming/outcomming traffic on the network device
2) stop all pending URB's (waiting for the ones in progress to finish)

The failure on resume happens after the USB probe has been done,
whenever we try to communicate to the device on endpoint 2.

Luis confirmed that the firmware will not try to send any packets to the
host before it receives some data on endpoint 2, meaning that we don't
suffer from the problem where the firmware sends data before the USB
links are fully established (network traffic is dropped in this case).

The host will only try to send packets after resume is done. The problem
happens _before_ that:

static int if_usb_resume(struct usb_interface *intf)
{
        struct usb_card_rec *cardp = usb_get_intfdata(intf);
        wlan_private *priv = cardp->priv;

        lbs_deb_enter(LBS_DEB_USB);

        cardp->rx_urb_recall = 0;

        if_usb_submit_rx_urb(cardp->priv);

        reestablish_connection(priv);    <------- HERE
        reestablish_connection(priv);
        reestablish_connection(priv);

        netif_device_attach(cardp->eth_dev);
        netif_device_attach(priv->mesh_dev);

comment:3 follow-up: Changed 7 years ago by rchokshi

I think it would be great if you can send a bad trace and a good trace collected from the USB protocol-level analyzer (Ellisys?). This will probably go a long way in explaining the root cause of this issue.

comment:4 in reply to: ↑ 3 Changed 7 years ago by cjb

Ronak,

Replying to rchokshi:

I think it would be great if you can send a bad trace and a good trace collected from the USB protocol-level analyzer (Ellisys?). This will probably go a long way in explaining the root cause of this issue.

Don't you already have a bad trace and a good trace? The trace you received contained several successful resumes followed by the failing resume.

comment:5 Changed 7 years ago by rchokshi

Is this trace taken with a wireless-initiated resume or a host-initiated resume? The trace with the failure is inconclusive as I mentioned in Trac #1752 because there is only one incomplete IN transaction on the bus.

comment:6 Changed 7 years ago by cjb

Hi Ronak,

Is this trace taken with a wireless-initiated resume or a host-initiated resume?

This one's host-initiated; that seems to be sufficient to reproduce the bug.

The trace with the failure is inconclusive as I mentioned in Trac #1752 because there is only one incomplete IN transaction on the bus.

I understand, but I'm not sure what you're asking us for. If our traces aren't useful, doesn't that mean we need to switch to a better analyzer..?

comment:7 Changed 7 years ago by jg

  • Keywords power added

comment:8 follow-up: Changed 7 years ago by jcardona

Reproduced the problem on a B4 with JTAG. With the debugger we could see that the firmware is in a valid state waiting to receive a command on ep2 to resume USB communication. The command is never received because the host controller reports a transaction error. According to the EHCI Spec the only causes for transaction errors are CRC, PID (Packet ID) or Timeout errors.

comment:9 Changed 7 years ago by jcardona

Next step is to configure the host controller to keep retrying forever (according to marcelot, this can be done by setting CERR to zero) and then inspect the USB bus to identify the cause of these low-level errors.

comment:10 Changed 7 years ago by rchokshi

Pasting Richard Smith's email here:

Of the 2 units I modified and brought only 1 worked when I got here.  I spent a bit of time in the marvel lab reworking the splice but the wireless device still does not show up.

The 2nd unit worked fine.  And we used it on the analyzer but it did not really tell us a lot as its hard to duplicate the problem on that unit.

Marvell now has example of a working unit with the USB splice mods and I've discussed with Ronak how to make the mods ever more robust with any of the other units they modify.  They are going to try and fix the broken one I brought and rework the B4 that they previously modified.

Marvell had other B4's and the first one we tried duplicated the problem nicely.  We then had that unit loaded with a wireless JTAG connector. 
We caused the failure and Javier inspected the state of the wireless firmware.

The issues we are seeing do not appear to be the result of the firmware or the kernel doing bad things _directly_.  The firmware appears to be in the proper state waiting for a command from the host that never makes it.  This jives with what we have seen so far on the USB analyzer.
If its a software problem then its really subtle and sneaky.

The error bit on the host controller seems to point to some sort of problem on the USB bus but so far nothing in any USB trace has shown errors.  It's very confusing because the EHCI spec indicates that these are all protocol errors.  Timing, CRC, or Pid.

Today before my flight Ronak and I are going to spend a bit of time with the USB expert here at Marvell and repeat the failure on the B4 with the USB analyzer hooked up to it using Marcelo's infinite USB retry version.

Yesterday we did not get to involve the USB expert because we had no clear failure case until very late in the day.

Now that we have a much clearer idea of what the failure is and how to cause it we hope that we can get better data back from the analyzer.  If
  not then this problem must be looked at by a signal level analyzer and not just a protocol analyzer.  One that can directly tap onto the signals rather than having to go in line.

comment:11 in reply to: ↑ 8 Changed 7 years ago by rchokshi

Replying to jcardona:

Reproduced the problem on a B4 with JTAG. With the debugger we could see that the firmware is in a valid state waiting to receive a command on ep2 to resume USB communication. The command is never received because the host controller reports a transaction error. According to the EHCI Spec the only causes for transaction errors are CRC, PID (Packet ID) or Timeout errors.

This morning, we confirmed that was no error on the USB bus using the LeCroy USB analyzer as well.
So, essentially, although these errors on coming up from the host, there is nothing even close to being fishy that we see on the USB bus. This is the point where we thought that it would be a good idea to involve AMD experts.

comment:12 Changed 7 years ago by rchokshi

  • Cc david-b@… added

Questions from David on software-team@… email distribution:

By "no error" do you mean "traffic that looked fine", or "no traffic, hence no error"? If there was traffic, was it in both directions?

Ans. "traffic that looked fine". It consists of IN and NAK transactions. This means that the USB device is being polled (by IN) and responds with NAK (no data).

comment:13 Changed 7 years ago by jcardona

Worked on exercising suspend/resume on the device side with minimal changes to host controller state. Could trace all the steps that lead to wireless device "hostsleep" state and the transition of the USB stack into 'unconfigured' state.

Found problem with one of the state transitions: if wake up traffic arrives between the time the hostsleep activate command is received and the usb port is suspended that would not stop incoming traffic. This is unlikely to be the cause of #2621 (this would cause an error at a higher layer, not the low level USB error that we are dealing with), but I fixed it and retest. Problem still there.

Marcelot/CosmicPenguin suggested that we use the test mode suspend and resume. We could still see the problem. Tested twice, once without wireless traffic, once while ping-fing from another box. First time failure occurred after 30 successful iterations, the second time after just 3.

Examined the state of the USB stack on the device: USB_STATE was 'configured', no errors. endpoint3 (device to host) was being serviced, endpoint2 was not.

Examined the state of endpoint2... disabled!! This bit under hardware control, the firmware only writes to it to enable it after the endpoint had been configured. The state of the register was different that power-on reset . Endpoint3 was enabled.

Questions:

  1. What could cause this endpoint to be disabled? (ruled out firmware, asked to USB HW engineer from Marvell)
  1. What came first? endpoint disabled then transaction errors on the host or vice versa?

To answer that, I disabled the bit with the debugger and immediately saw:

# [  192.303255] devpath 1 ep2out 3strikes

Just re-enabling the bit does not recover the endpoint.

comment:14 Changed 7 years ago by jcardona

I've narrowed down the problem to where endpoint 2 is initialized after suspend. The configuration bits written to the endpoint configuration register are the same on successful and failed cycles. But on failure the register will not take the configuration and ep2 is left disabled.

comment:15 follow-up: Changed 7 years ago by rchokshi

Chris Ball asked us to try out the following:

Here is a kernel that sets the CERR retry field on the host controller to zero (infinite retries):

http://dev.laptop.org/~cjb/ronak/

The files need to be copied to:

vmlinuz -> /boot
libertas,usb8xxx.ko -> /lib/modules/2.6.22*/kernel/drivers/net/wireless/libertas

.. on top of a build542 image

And this is the result on a B4 XO:

[  705.227002] Stopping tasks ... done.
[  705.313356] olpc_do_sleep!
+r[    0.072886] olpm-pm:  SCI 0x20 received
[    0.078299] olpm-pm:  SCI 0x0 received
[    0.484246] usb usb1: root hub lost power or was reset
[    0.706011] usb_reset_device!
[    0.767093] hub_port_wait_reset: portstatus=503 portchange=10
[    0.897456] hub_port_wait_reset: portstatus=503 portchange=10
[    0.997944] old descriptor:
[    1.001164] bLength: 12
[    1.003988] bDescriptorType: 1
[    1.007604] bcdUSB: 200
[    1.010441] bDeviceClass: 0
[    1.013656] bDeviceSubClass: 0
[    1.017307] bDeviceProtocol: 0
[    1.020824] bMaxPacketSize0: 40
[    1.024533] idVendor: 1286
[    1.027650] idProduct: 2001
[    1.030865] bcdDevice: 3107
[    1.034173] iManufacturer: 1
[    1.037484] iProduct: 2
[    1.040301] iSerialNumber: 0
[    1.043725] bNumConfigurations: 1
[    1.047536] new descriptor:
[    1.050751] bLength: 12
[    1.053670] bDescriptorType: 1
[    1.057180] bcdUSB: 200
[    1.059995] bDeviceClass: 0
[    1.063313] bDeviceSubClass: 0
[    1.066832] bDeviceProtocol: 0
[    1.070342] bMaxPacketSize0: 40
[    1.074052] idVendor: 1286
[    1.090317] idProduct: 2001
[    1.106243] bcdDevice: 3105
[    1.121745] iManufacturer: 1
[    1.136938] iProduct: 2
[    1.151585] iSerialNumber: 0
[    1.166718] bNumConfigurations: 1
[    1.182251] USB_REQ_SET_CONFIGURATION!
[    1.241867] Restarting tasks ... <6>olpc-dcon: The CPU has control
[    1.304709] done.
[    2.356291] Stopping tasks ... done.
[    2.410922] olpm-pm:  SCI 0x20 received
[    2.416905] olpm-pm:  SCI 0x0 received
[    2.445144] olpc_do_sleep!
+r[    0.027585] olpm-pm:  SCI 0x20 received
[    0.032457] olpm-pm:  SCI 0x0 received
[    0.524245] usb usb1: root hub lost power or was reset
[    0.765540] usb_reset_device!
[    0.837485] hub_port_wait_reset: portstatus=503 portchange=10
[    0.976973] hub_port_wait_reset: portstatus=503 portchange=10
[    1.087713] old descriptor:
[    1.103293] bLength: 12
[    1.118433] bDescriptorType: 1
[    1.134225] bcdUSB: 200
[    1.149329] bDeviceClass: 0
[    1.164830] bDeviceSubClass: 0
[    1.180634] bDeviceProtocol: 0
[    1.196441] bMaxPacketSize0: 40
[    1.212367] idVendor: 1286
[    1.227807] idProduct: 2001
[    1.243370] bcdDevice: 3107
[    1.258898] iManufacturer: 1
[    1.274551] iProduct: 2
[    1.289715] iSerialNumber: 0
[    1.305368] bNumConfigurations: 1
[    1.321539] new descriptor:
[    1.337132] bLength: 12
[    1.352331] bDescriptorType: 1
[    1.368224] bcdUSB: 200
[    1.383449] bDeviceClass: 0
[    1.399070] bDeviceSubClass: 0
[    1.414999] bDeviceProtocol: 0
[    1.430955] bMaxPacketSize0: 40
[    1.447039] idVendor: 1286
[    1.462627] idProduct: 2001
[    1.478312] bcdDevice: 3105
[    1.493993] iManufacturer: 1
[    1.509770] iProduct: 2
[    1.525085] iSerialNumber: 0
[    1.540893] bNumConfigurations: 1
[    1.557184] USB_REQ_SET_CONFIGURATION!
[    1.619485] Restarting tasks ... done.
[    1.661182] olpm-pm:  SCI 0x1 received
[    1.678398] olpm-pm:  SCI 0x0 received
[    2.759037] Stopping tasks ... done.
[    2.826470] olpm-pm:  SCI 0x20 received
[    2.831341] olpm-pm:  SCI 0x0 received
[    2.865452] olpc_do_sleep!
+r[    0.027615] olpm-pm:  SCI 0x20 received
[    0.032484] olpm-pm:  SCI 0x0 received
[    0.444212] usb usb1: root hub lost power or was reset
[    0.685507] usb_reset_device!
[    0.756965] hub_port_wait_reset: portstatus=503 portchange=10
[    0.896952] hub_port_wait_reset: portstatus=503 portchange=10
[    1.008199] old descriptor:
[    1.023381] bLength: 12
[    1.038153] bDescriptorType: 1
[    1.053581] bcdUSB: 200
[    1.068349] bDeviceClass: 0
[    1.083575] bDeviceSubClass: 0
[    1.099105] bDeviceProtocol: 0
[    1.114545] bMaxPacketSize0: 40
[    1.130045] idVendor: 1286
[    1.145118] idProduct: 2001
[    1.160316] bcdDevice: 3107
[    1.175416] iManufacturer: 1
[    1.190611] iProduct: 2
[    1.205318] iSerialNumber: 0
[    1.220422] bNumConfigurations: 1
[    1.236013] new descriptor:
[    1.251088] bLength: 12
[    1.265764] bDescriptorType: 1
[    1.281111] bcdUSB: 200
[    1.296027] bDeviceClass: 0
[    1.311164] bDeviceSubClass: 0
[    1.326635] bDeviceProtocol: 0
[    1.342012] bMaxPacketSize0: 40
[    1.357485] idVendor: 1286
[    1.372524] idProduct: 2001
[    1.387661] bcdDevice: 3105
[    1.402762] iManufacturer: 1
[    1.417957] iProduct: 2
[    1.432663] iSerialNumber: 0
[    1.447800] bNumConfigurations: 1
[    1.463328] USB_REQ_SET_CONFIGURATION!
[    1.526758] Restarting tasks ... <6>olpc-dcon: The CPU has control
[    1.578253] olpm-pm:  SCI 0x1 received
[    1.596346] olpm-pm:  SCI 0x0 received
[    1.625658] done.
[    2.666470] Stopping tasks ... done.
[    2.719982] olpm-pm:  SCI 0x20 received
[    2.724855] olpm-pm:  SCI 0x0 received
[    2.765401] olpc_do_sleep!
+r[    0.027614] olpm-pm:  SCI 0x20 received
[    0.032487] olpm-pm:  SCI 0x0 received
[    0.324603] usb usb1: root hub lost power or was reset
[    0.565410] usb_reset_device!
[    0.637242] hub_port_wait_reset: portstatus=503 portchange=10
[    0.777356] hub_port_wait_reset: portstatus=503 portchange=10
[    0.888102] old descriptor:
[    0.903742] bLength: 12
[    0.918942] bDescriptorType: 1
[    0.934796] bcdUSB: 200
[    0.950052] bDeviceClass: 0
[    0.965766] bDeviceSubClass: 0
[    0.981693] bDeviceProtocol: 0
[    0.997530] bMaxPacketSize0: 40
[    1.013518] idVendor: 1286
[    1.029110] idProduct: 2001
[    1.044763] bcdDevice: 3107
[    1.060448] iManufacturer: 1
[    1.076190] iProduct: 2
[    1.091385] iSerialNumber: 0
[    1.107007] bNumConfigurations: 1
[    1.123209] new descriptor:
[    1.138831] bLength: 12
[    1.154058] bDescriptorType: 1
[    1.170015] bcdUSB: 200
[    1.185332] bDeviceClass: 0
[    1.201105] bDeviceSubClass: 0
[    1.217126] bDeviceProtocol: 0
[    1.232991] bMaxPacketSize0: 40
[    1.249012] idVendor: 1286
[    1.264635] idProduct: 2001
[    1.280317] bcdDevice: 3105
[    1.295937] iManufacturer: 1
[    1.311684] iProduct: 2
[    1.326876] iSerialNumber: 0
[    1.342410] bNumConfigurations: 1
[    1.358060] USB_REQ_SET_CONFIGURATION!

comment:16 in reply to: ↑ 15 ; follow-up: Changed 7 years ago by cjb

Hi Ronak,

Replying to rchokshi:

Chris Ball asked us to try out the following:

Here is a kernel that sets the CERR retry field on the host controller to zero (infinite retries):

It looks like the test didn't crash, whici probably means you weren't running a pingflood to the host. You need to start "ping -f <ip>" before suspending. The idea was to cause the crash, and then see what happens differently on the bus when cerr is zero; i.e. the command is retried infinitely.

comment:17 in reply to: ↑ 16 Changed 7 years ago by jcardona

Replying to cjb:

The idea was to cause the crash, and then see what happens differently on the bus
when cerr is zero; i.e. the command is retried infinitely.

We now know that after the failure, EP2 is disabled. The XactErr is probably the consequence of the endpoint being disabled, and not the cause (see question 2 in my previous comment). My guess is that with CERR=0, we'll just see token packets from the host and no handshake packets from the device on endpoint 2. We need to answer the following question:

Under which conditions the device disables EP2?

So far we just know that it is the USB controller in the device side that is disabling the endpoint and not the firmware. That may be a response to certain events on the bus. Captures that show the moment in which the device stops responding to token packets on endpoint 2 could give us a hint.

Changed 7 years ago by rchokshi

BadTrace#1.usb

Changed 7 years ago by rchokshi

BadTrace#2.usb

comment:18 Changed 7 years ago by rchokshi

I think the attached captures will prove to be quite useful. I just captured it this afternoon. There are two of them. There is a STALL after the SET_CONFIGURATION command at timestamp 2.0740 1155.

comment:19 Changed 7 years ago by jcardona

Notes on last captures

BadTrace1.usb

xfer 8 is the last xfer on ep2 after resume.
On the last transaction of the xfer device sends a NAK (packet 7387) then there are no more tokens from the host on that endpoint. Shouldn't the host retry?

BadTrace2.usb

xfer 7 is the (first and) last xfer on ep2 after resume
There are no handshake tokens from the device in response to packet 5933.
This is probably an instance of the failure where the device is unable to re-enable endpoint2 after resume, as described here.

BTW LeCroy reports it as an isochronous transfer ?! (it should be bulk)
But I'm guessing that the analyzer marks that transfer as isochronous because it sees no handshake packet from the device after packet 5933.

Changed 7 years ago by rchokshi

Good Trace

comment:20 Changed 7 years ago by rchokshi

Per Jordan's request, I am attaching a good trace.

comment:21 Changed 7 years ago by jcardona

Under certain error conditions the USB stack on the device side does not return a handshake packet. We have added code to better handle those error conditions. With each fix we've provided a private build of the firmware to all those working on the problem.

We have seen the frequency of the failures gradually drop, which indicates that there may not be a single root cause but several independent problems with the same observable symptoms. At the time of this writing several teams (cjb, cozybit, Marvell-Pune) are trying to reproduce this problem on the latest debug build. Only cjb has been able to reproduce the issue but not reliably.

comment:22 Changed 7 years ago by cjb

Here's a crash from one of the 1cc machines. B4, playground HEAD, usb8388-17p2-with-resume-fixes-and-usb-error-flags-in-dbg-beacons.bin.

[    2.608578] PM: Preparing system for mem sleep                               
[    2.613175] Stopping tasks ... done.                                         
[    2.627360] platform olpc-battery.0: suspend                                 
[    2.631912] ac97 0-0:AD1888: suspend                                         
[    2.635564] ov7670 1-0042: suspend                                           
[    2.639095] platform dcon: suspend                                           
[    2.642662] OLPC-DCON 0-000d: suspend                                        
[    2.677081] olpc-dcon: The DCON has control                                  
[    2.681412] psmouse serio1: suspend                                          
[    2.685008] atkbd serio0: suspend                                            
[    2.688387] i8042 i8042: suspend                                             
[    2.691827] hub 2-0:1.0: suspend                                             
[    2.695124] usb usb2: suspend, may wakeup                                    
[    2.699250] hub 2-0:1.0: hub_suspend                                         
[    2.703000] usb usb2: bus suspend                                            
[    2.706417] ohci_hcd 0000:00:0f.4: suspend root hub                          
[    2.711523] usb8xxx 1-1:1.0: suspend                                         
[    2.715139] usb 1-1: suspend                                                 
[    2.721576] hub 1-0:1.0: suspend                                             
[    2.724821] usb usb1: suspend, may wakeup                                    
[    2.728951] hub 1-0:1.0: hub_suspend                                         
[    2.732543] usb usb1: bus suspend                                            
[    2.735964] ehci_hcd 0000:00:0f.5: suspend root hub                          
[    2.741672] serial8250 serial8250: suspend                                   
[    2.745884] ehci_hcd 0000:00:0f.5: port 1 remote wakeup                      
[    2.751217] platform vesafb.0: suspend                                       
[    2.755067] platform pcspkr: suspend                                         
[    2.758658] serial8250 serial8250.0: suspend                                 
[    2.763041] ehci_hcd 0000:00:0f.5: suspend                                   
[    2.777461] ehci_hcd 0000:00:0f.5: --> PCI D3                                
[    2.781848] ohci_hcd 0000:00:0f.4: suspend                                   
[    2.793081] ohci_hcd 0000:00:0f.4: --> PCI D3                                
[    2.797459] cs5535audio 0000:00:0f.3: suspend                                
[    2.802279] pci 0000:00:0f.0: suspend                                        
[    2.805971] cafe1000-ccic 0000:00:0c.2: suspend                              
[    2.811674] sdhci 0000:00:0c.1: suspend                                      
[    2.817491] CAF�É NAND 0000:00:0c.0: suspend                                 
[    2.821829] Geode LX AES 0000:00:01.2: suspend                               
[    2.826410] lxfb 0000:00:01.1: suspend                                       
[    2.830177] pci 0000:00:01.0: suspend                                        
[    2.833978] platform lid: suspend, may wakeup                                
[    2.838349] platform powerbutton: suspend, may wakeup                        
[    2.843510] rtc_cmos rtc_cmos: suspend, may wakeup                           
[    2.848320] PM: Entering mem sleep                                           
[    2.851840] platform olpc-battery.0: LATE suspend                            
[    2.856565] platform dcon: LATE suspend                                      
[    2.860419] serial8250 serial8250: LATE suspend                              
[    2.864975] platform vesafb.0: LATE suspend                                  
[    2.869167] platform pcspkr: LATE suspend                                    
[    2.873187] serial8250 serial8250.0: LATE suspend                            
[    2.877902] cs5535audio 0000:00:0f.3: LATE suspend                           
[    2.882707] pci 0000:00:0f.0: LATE suspend                                   
[    2.886816] cafe1000-ccic 0000:00:0c.2: LATE suspend                         
[    2.891787] sdhci 0000:00:0c.1: LATE suspend                                 
[    2.896068] CAF�É NAND 0000:00:0c.0: LATE suspend                            
[    2.900781] Geode LX AES 0000:00:01.2: LATE suspend                          
[    2.905668] pci 0000:00:01.0: LATE suspend                                   
[    2.909774] platform lid: LATE suspend, may wakeup                           
[    2.914574] platform powerbutton: LATE suspend, may wakeup                   
[    2.920067] rtc_cmos rtc_cmos: LATE suspend, may wakeup                      
[    2.925602] olpc_do_sleep!                                                   
+r[    0.022691] rtc_cmos rtc_cmos: EARLY resume                                
[    0.026897] platform powerbutton: EARLY resume                               
[    0.031347] platform lid: EARLY resume                                       
[    0.035109] pci 0000:00:01.0: EARLY resume                                   
[    0.039219] lxfb 0000:00:01.1: EARLY resume                                  
[    0.043418] Geode LX AES 0000:00:01.2: EARLY resume                          
[    0.048305] CAF�É NAND 0000:00:0c.0: EARLY resume                            
[    0.053018] sdhci 0000:00:0c.1: EARLY resume                                 
[    0.057299] cafe1000-ccic 0000:00:0c.2: EARLY resume                         
[    0.062271] pci 0000:00:0f.0: EARLY resume                                   
[    0.066377] cs5535audio 0000:00:0f.3: EARLY resume                           
[    0.071179] ohci_hcd 0000:00:0f.4: EARLY resume                              
[    0.075718] ehci_hcd 0000:00:0f.5: EARLY resume                              
[    0.080260] serial8250 serial8250.0: EARLY resume                            
[    0.084979] platform pcspkr: EARLY resume                                    
[    0.088999] platform vesafb.0: EARLY resume                                  
[    0.093212] serial8250 serial8250: EARLY resume                              
[    0.097754] i8042 i8042: EARLY resume                                        
[    0.101433] platform dcon: EARLY resume                                      
[    0.105286] platform olpc-battery.0: EARLY resume                            
[    0.111226] olpm-pm:  SCI 0x20 received                                      
[    0.116087] olpm-pm:  SCI 0x0 received                                       
[    0.119856] PM: Finishing wakeup.                                            
[    0.123193] rtc_cmos rtc_cmos: resuming                                      
[    0.127044] platform powerbutton: resuming                                   
[    0.131148] platform lid: resuming                                           
[    0.134563] pci 0000:00:01.0: resuming                                       
[    0.138342] lxfb 0000:00:01.1: resuming                                      
[    0.142333] Geode LX AES 0000:00:01.2: resuming                              
[    0.146897] CAF�É NAND 0000:00:0c.0: resuming                                
[    0.151272] sdhci 0000:00:0c.1: resuming                                     
[    0.368209] PM: Writing back config space on device 0000:00:0c.1 at offset f (was 8080100, writing 808010b)                                                  
[    0.378066] PM: Writing back config space on device 0000:00:0c.1 at offset 3 (was 800000, writing 802000)                                                    
[    0.387748] PM: Writing back config space on device 0000:00:0c.1 at offset 1 (was 2b00002, writing 2b00006)                                                  
[    0.397648] cafe1000-ccic 0000:00:0c.2: resuming                             
[    0.402299] PM: Writing back config space on device 0000:00:0c.2 at offset f (was 8080100, writing 808010b)                                                  
[    0.416215] PM: Writing back config space on device 0000:00:0c.2 at offset 3 (was 800000, writing 802000)                                                    
[    0.427906] PM: Writing back config space on device 0000:00:0c.2 at offset 1 (was 2b00002, writing 2b00006)                                                  
[    0.442800] pci 0000:00:0f.0: resuming                                       
[    0.446688] cs5535audio 0000:00:0f.3: resuming                               
[    0.451167] PCI: Setting latency timer of device 0000:00:0f.3 to 64          
[    0.461889] ohci_hcd 0000:00:0f.4: resuming                                  
[    0.466200] ohci_hcd 0000:00:0f.4: PCI D0, from previous PCI D3              
[    0.473147] PCI: Setting latency timer of device 0000:00:0f.4 to 64          
[    0.479449] ohci_hcd 0000:00:0f.4: powerup ports                             
[    0.517646] ehci_hcd 0000:00:0f.5: resuming                                  
[    0.521844] ehci_hcd 0000:00:0f.5: PCI D0, from previous PCI D3              
[    0.534653] PCI: Setting latency timer of device 0000:00:0f.5 to 64          
[    0.540956] ehci_hcd 0000:00:0f.5: lost power, restarting                    
[    0.546450] usb usb1: root hub lost power or was reset                       
[    0.551614] ehci_hcd 0000:00:0f.5: reset command 080002 (park)=0 ithresh=8 period=1024 Reset HALT                                                            
[    0.560616] PCI: cache line size of 32 is not supported by device 0000:00:0f.5                                                                               
[    0.567931] ehci_hcd 0000:00:0f.5: ...powerdown ports...                     
[    0.597705] ehci_hcd 0000:00:0f.5: ...powerup ports...                       
[    0.627980] serial8250 serial8250.0: resuming                                
[    0.632370] platform pcspkr: resuming                                        
[    0.636152] platform vesafb.0: resuming                                      
[    0.640070] serial8250 serial8250: resuming                                  
[    0.644357] usb usb1: resuming                                               
[    0.647429] usb usb1: usb resume                                             
[    0.650700] ehci_hcd 0000:00:0f.5: resume root hub after power loss          
[    0.708028] hub 1-0:1.0: enabling power on all ports                         
[    0.815925] hub 1-0:1.0: resuming                                            
[    0.819260] usb 1-1: resuming                                                
[    0.822239] usb 1-1: usb reset-resume                                        
[    0.826007] usb_reset_device!                                                
[    0.886907] ehci_hcd 0000:00:0f.5: port 1 high speed                         
[    0.891900] ehci_hcd 0000:00:0f.5: GetStatus port 1 status 001005 POWER sig=se0 PE CONNECT                                                                   
[    0.900274] hub_port_wait_reset: portstatus=503 portchange=10                
[    0.956997] usb 1-1: reset high speed USB device using ehci_hcd and address 2
[    5.964936] usb 1-1: bash timed out on ep0in len=0/64   

Does this looks like our bug? Marcelo says:

16:13 <marcelot> what happened is that the device did not respond to GetConfig

I'm trying to capture a beacon log, but I don't seem to be seeing the beacons, just Assoc Requests from other laptops. Javier, could you come on IRC and give me a hand, please?

comment:24 Changed 7 years ago by jcardona

I've prepared a debug build that captures the suspend/resume execution trace and adds it to the debug beacons. Details here:

https://cozybit1.dnsalias.org/~javier/logs/suspend_resume_debug_beacon.txt

With that build cjb was able to reproduce the crash, and on the failed machine we see that the USB events were processed out of order. In particular, we see a usb port reset processed before usb resume. On suspend endpoint 0 is disabled and on reset it's re-enabled. If those two events get reversed, we would see a failure on endpoint 0 on the next transaction (which, after a reset, is Get Description).

comment:25 Changed 7 years ago by jcardona

There's only one isr in the firmware for all the USB interrupts, and it processes a port reset event before processing a suspend. If a suspend interrupt is not acknowledged before a port reset happens, the isr will process both conditions out of order, resulting in leaving ep0 disabled when it should not be.

I've corrected that and prepared another build ( https://cozybit1.dnsalias.org/~javier/logs/suspend_resume_debug_beacon.txt )

I've also added a new debug flag to detect that condition. If we ever see the flag USB_ERROR_SUSPEND_RESET_AT_ONCE we know that the condition existed. If we do not see the flag, and we see the same crash, with the same sequence (reset then suspend) it would indicate that the wireless module does receive those events in that order, and we'd have to correct that.

comment:26 Changed 7 years ago by jcardona

All the fixes implemented on the wireless firmware for this problem have been merged into release 5.110.17.p5 except for the USB_ERROR_SUSPEND_RESET_AT_ONCE described in the previous entry.

Although we cannot claim that the problem is entirely resolved, the fixes have dramatically reduced the frequency of this problem. To the point that even cjb is seeing ticket #1835 before seeing this problem.

The plan is to retest for #2621 once #1835 is resolved.

comment:27 Changed 7 years ago by jg

  • Cc jg added

comment:28 Changed 7 years ago by wad

Just noting that this is probably the same problem as #1752.

In order to reproduce this problem, start with two laptops: one under test (the test laptop) and one triggering it with a ping flood (the trigger).

On both laptops, from a root console, type:

init 3
service NetworkManager stop
iwconfig eth0 essid testing mode ad-hoc
iwconfig eth0 channel 11

On the test laptop, type:

ifconfig msh0 192.168.4.50

On the trigger laptop, type:

ifconfig msh0 192.168.4.51

Now, to start the test, on the test laptop type:

count=0
while true; do ((count=count+1)); echo $count; echo mem > /sys/power/state; done

And on the trigger laptop, type:

ping 192.168.4.50 &
ping -f 192.168.4.50

It has been suggested that also pinging the trigger laptop from the test laptop will make the problem worse.

For convenience, I have these commands in scripts which I can simply wget onto the laptops. On the test laptop:

wget http://xs-dev.laptop.org/wad/setup_fping1
wget http://xs-dev.laptop.org/wad/run_fping1
chmod a+x *fping1
./setup_fping1
./run_fping1

And on the trigger laptop:

wget http://xs-dev.laptop.org/wad/setup_fping2
chmod a+x *fping2
./setup_fping2
ping 192.168.4.50 &
ping -f 192.168.4.50

comment:29 Changed 7 years ago by jcardona

We have been trying to reproduce this problem on one of the C1's ECO'd for #1835 that Chris sent us this week. So far we have not been able to reproduce. We do see lots of USB failures (urb status -71 and -2 in the kernel logs) but the wireless stays operational and responsive after thousands of iterations.

First Overnight Test

In this test the DUT was pinging a non-existent mesh node. This should put some stress on the wireless internal buffers, as unresolved frames get queued up and discarded. The test run for 3K+ suspend iterations and failed due to a kernel panic.

INIT: PANIC: segmentation violation at 0x804a200 (code)! sleeping for 30 seconds

Full log here: here

Notes on this log:

  1. The PANIC does not show up on that log, as the serial console died after the PANIC).
  2. The frequent 'E' messages are generated by flood ping when it fails to send a frame to the device. Pinging a non-existent mesh node will do that.

Second Overnight Test

This time the DUT was pinging another mesh node, in the same way that wad describes in the previous comment. The test went on for over 10K iterations. At some point it looks like the suspend script stops. There were many USB errors reported on the kernel logs (~2K '-71' errors and ~4K '-2' errors, but after all those iterations the wireless was operational and responding to host commands. The wireless device had not been reset, nor the firmware reloaded.

Full log available here (120 MB!). Last iteration here.

comment:30 Changed 7 years ago by cjb

  • Cc marcelo dilinger added

INIT: PANIC: segmentation violation at 0x804a200 (code)! sleeping for 30 seconds

This isn't a kernel panic. Marcelo/Andres, what could cause this?

At some point it looks like the suspend script stops.

It looks like your suspend script segfaulted, which is a similar outcome to the first test. I've never seen this.

comment:31 Changed 7 years ago by wad

I'm seeing a "new" behavior here at home (a pretty low network traffic area): the machine randomly starts ignoring pings and stays asleep. Pressing a keyboard key or the touchpad wakes it up and it continues to run (going back into suspend mode). To restart the ping/resume sequence, the machine has to be awake for one ping (which gets replied to). It may have to be manually woken up several times before this happens, depending on the ping interval.

Strictly speaking, this should be posted against #1752, as I was only pinging the machine once every five seconds (and in a very quiet WiFi neighborhood). Running build 581 on both test and trigger laptops. The test laptop was plugged in and had been ECOd for bug #1835.

comment:32 Changed 7 years ago by cjb

I'm seeing a "new" behavior here at home (a pretty low network traffic area): the machine randomly starts ignoring pings and stays asleep

Sorry, Wad, this is my fault -- I should have told you that you need to add a static ARP entry for the DUT on the pinging host, to avoid the ARP cache entry expiring and leaving you no way to wake it up remotely.

It's not clear to me why the entry expires, when it's being seen once every few seconds.

The process is:

# arp -n
<IP>  <MAC>
# arp -s <IP> <MAC>

comment:33 Changed 7 years ago by kimquirk

  • Milestone changed from Untriaged to Trial-3

comment:34 Changed 7 years ago by kimquirk

  • Milestone changed from Trial-3 to First Deployment, V1.0
  • Priority changed from blocker to high

This is unclear if it is hardware or software. We are upgrading all the test machines to the latest wireless firmware: 19.p0 to see if this helps.

comment:35 Changed 7 years ago by jcardona

  • Resolution set to fixed
  • Status changed from new to closed

According from wad, this problem was fixed in EC firmware, included in OFW version q2c28 and later.

comment:37 Changed 7 years ago by sell

http://www.finishup.cn H?r f?rs?kte allts? pejlaren f? mig att "erk?nna" genom http://www.chinawordpress.net.cn hota med b?ter och r?tteg?ng. Hon f?rs?kte ?ven f? mig att erk?nna att jag har en http://www.chinaword.net.cn DVD-spelare kan emot tevesignaler. Visst finns det kanske DVD-spelare med tevemottagare men en som sitter i datorn? Lite komiskt. Videoapparater har oftast tevemottagare s? hade jag sagt ja d?r hade jag f?rmodligen varit illa ute http://www.tatto.com.cn Vid en r?tteg?ng st?r pejlarens ord mot ditt eftsom pejlaren ?r en tj?nsteman. Det vill s?ga att om pejlaren ser teven s? kan hon s?ga detta i r?tten och f?modligen vinna ocks?. Men om hon varken h?r eller ser teven s? har http://www.treadmilll.cn

Note: See TracTickets for help on using tickets.