Ticket #7458 (assigned defect)

Opened 6 years ago

Last modified 4 years ago

Intermittent lockup during WOL suspend/resume stress

Reported by: dsaxena Owned by: dsaxena
Priority: blocker Milestone: 9.1.0-cancelled
Component: not assigned Version: Development build as of this date
Keywords: blocks+:9.1.0 Cc: mstone, dilinger, cjb, jcardona, rsmith, pgf, mbletsas, dwmw2
Action Needed: diagnose Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Email form cjb:

As a heads up, I've seen a hang on resume (or suspend?) with latest testing. It looks like a scheduling problem in libertas -- dmesg overflows such that I can't see what caused the hang, but that seems to be what the rest of it's trying to say.

Currently trying to reproduce and gather some more data. Javier, we might need some help from your end once we figure out what's happening.

Attachments

kernel.log Download (0.8 MB) - added by jcardona 6 years ago.
dmesg output
2.6.26-suspend-resume-test-log Download (9.6 MB) - added by dsaxena 6 years ago.
Suspend/resume test log on 2.6.26
crash.log Download (57.4 KB) - added by rsmith 6 years ago.
oops2.log Download (172.2 KB) - added by rsmith 6 years ago.

Change History

  Changed 6 years ago by dsaxena

  • keywords changed from joyride-2131:- to joyride-2131:-
  • next_action changed from never set to diagnose
  • blocking 7393 added
  • milestone changed from Never Assigned to 8.2.0 (was Update.2)

follow-up: ↓ 3   Changed 6 years ago by dsaxena

  • priority changed from normal to blocker

OK, was able to reproduce again after 290 S/R cycles. Only useful piece of data I got was as follows:

Jul  9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] devpath 1 ep3in 3strikes
Jul  9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] devpath 1 ep2out 3strikes
Jul  9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] usb8xxx: URB in failure status: -71
Jul  9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] Restarting tasks ... done.

cjb, you had mentioned 3strikes when I was in town a few weeks ago. Do you or dilinger have summary of what it means to hit this condition?

in reply to: ↑ 2 ; follow-up: ↓ 4   Changed 6 years ago by dsaxena

Replying to dsaxena:

OK, was able to reproduce again after 290 S/R cycles. Only useful piece of data I got was as follows: {{{ Jul 9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] devpath 1 ep3in 3strikes Jul 9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] devpath 1 ep2out 3strikes Jul 9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] usb8xxx: URB in failure status: -71 Jul 9 18:23:41 xo-0C-FF-AA kernel: [ 3753.036994] Restarting tasks ... done. }}}

The system did not in fact lockup. It came back up out of resume, but not immediately. Back to trying to gather more data.

cjb, you had mentioned 3strikes when I was in town a few weeks ago. Do you or dilinger have summary of what it means to hit this condition?

in reply to: ↑ 3   Changed 6 years ago by cjb

Replying to dsaxena:

cjb, you had mentioned 3strikes when I was in town a few weeks ago. Do you or dilinger have summary of what it means to hit this condition?

It's raised by the host controller. From #2621:

The command is never received because the host controller reports a transaction error. According to the EHCI Spec the only causes for transaction errors are CRC, PID (Packet ID) or Timeout errors.

This isn't the scheduling bug that I saw last night, so we're in bad shape.

follow-up: ↓ 6   Changed 6 years ago by cjb

The system did not in fact lockup. It came back up out of resume, but not immediately. Back to trying to gather more data.

That's something, I guess. Was the network interface present? (When we detect this situation, I think we force a reset of the device.)

in reply to: ↑ 5 ; follow-up: ↓ 7   Changed 6 years ago by dsaxena

Replying to cjb:

The system did not in fact lockup. It came back up out of resume, but not immediately. Back to trying to gather more data.

That's something, I guess. Was the network interface present? (When we detect this situation, I think we force a reset of the device.)

Yes, the interface was up.

I've been running a suspend/resume loop test for the last 12 hours now, with about 1900 cycles so far with no lockup again, so still no real data to look at. :(

in reply to: ↑ 6   Changed 6 years ago by cjb

Replying to dsaxena:

Yes, the interface was up. I've been running a suspend/resume loop test for the last 12 hours now, with about 1900 cycles so far with no lockup again, so still no real data to look at. :(

That's interesting. Are you associated to anything while you suspend/resume? Maybe worth connecting to an AP first, and also opening some TCP connections first.

  Changed 6 years ago by andrey

Hello. We left an XO suspending/resuming all night. Build 2136, with the network interface associated. To suspend:

# while true; do echo mem > /sys/power/state; done

and meanwhile to resume, we have another PC ping the XO continuously, causing a wake on LAN. It did suspend/resume all night (the serial terminal is filled with "+r" messages) but we can't get back to the kernel, so I believe we've reproduced this. I'll try to set up the experiment to save some debug information.

  Changed 6 years ago by cjb

Great, thanks Andrey. I suggest booting with "no_console_suspend" kernel arg (/versions/boot/current/boot/olpc.fth) and running:

# echo 9 > /proc/sys/kernel/printk

before starting the suspends. Thanks!

  Changed 6 years ago by dsaxena

Replying to cjb:

Replying to dsaxena:

Yes, the interface was up. I've been running a suspend/resume loop test for the last 12 hours now, with about 1900 cycles so far with no lockup again, so still no real data to look at. :(

That's interesting. Are you associated to anything while you suspend/resume? Maybe worth connecting to an AP first, and also opening some TCP connections first.

I was running something similar to what Andrey ran with a remote machine pinging the XO for wakeup. Only difference was that i put a small small sleep between each wakeup/sleep and logged some debug info. That delay may have been enough to get around this issue since it might be timing related.

  Changed 6 years ago by dsaxena

Andrey, have you gotten any more information on this? For some reason I can't get no_console_suspend to work nicely for me over here so don't have much more data. I did run another overnight test and saw the 3strikes error again.

  Changed 6 years ago by jcardona

Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088069] libertas host: CMD_RESP: response 0x8045, seq 142, size 8, jiffies 282254

Firmware response to the CMD_802_11_HOST_SLEEP_ACTIVATE command: USB endpoints are now disabled.

Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088105] libertas thread: main-thread 111: intcounter=0 currenttxskb=00000000 dnld_sent=0
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088132] libertas thread: main-thread sleeping... Conn=0 IntC=0 PS_mode=0 PS_State=0
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088550] ehci_bus_suspend: USBCMD: 0x10009
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088568] ehci_bus_suspend: USBSTS: 0x8
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088585] ehci_bus_suspend: USBCMD: 0x10009
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088601] ehci_bus_suspend: USBSTS: 0x8
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088926] ehci_bus_suspend: USBCMD: 0x10008
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.088942] ehci_bus_suspend: USBSTS: 0x1008
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.110373] ehci_pci_suspend: USBCMD: 0x10008
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.110373] ehci_pci_suspend: USBSTS: 0x1008
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.146903] olpc-ec:  running cmd 0x1c
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  received 0xff
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  running cmd 0x1b
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  sending cmd arg 0xbb
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  running cmd 0x32
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  timeout waiting for EC to read command!

The last command seems to have failed. We should see a 'olpc_do_sleep!' message and we do not.

Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  running cmd 0x84
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.149947] olpc-ec:  received 0x20
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.150039] olpc-pm:  SCI 0x20 received
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.150059] olpc-ec:  running cmd 0x84
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.159947] olpc-ec:  received 0x0
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.160013] olpc-pm:  SCI 0x0 received

We never see a olpc-ec: running cmd 0x24 which wakes up the firmware via GPIO, so the USB endpoints are not restored.

Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.176786] PM: Writing back config space on device 0000:00:0c.1 at offset 1 (was 2b00002, writing 2b00006)
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.176916] PM: Writing back config space on device 0000:00:0c.2 at offset 1 (was 2b00002, writing 2b00006)
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.196629] PCI: Setting latency timer of device 0000:00:0f.3 to 64
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.208007] PCI: Setting latency timer of device 0000:00:0f.4 to 64
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.262198] PCI: Setting latency timer of device 0000:00:0f.5 to 64
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.489425] libertas host: PREP_CMD: command 0x001f
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.489448] libertas host: QUEUE_CMD: inserted command 0x001f into cmdpendingq
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.489487] libertas thread: main-thread 222 (waking up): intcounter=0 currenttxskb=00000000 dnld_sent=0
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.489516] libertas thread: main-thread 333: intcounter=0 currenttxskb=00000000 dnld_sent=0
Jul 11 18:42:53 xo-14-68-7A kernel: [ 3173.489555] devpath 1 ep3in 3strikes

... and so the command fails to be transmitted.

Changed 6 years ago by jcardona

dmesg output

  Changed 6 years ago by jcardona

So, to summarize, this looks like a problem in transitioning into power save and not wireless a wireless issue.

follow-up: ↓ 15   Changed 6 years ago by dsaxena

  • cc rsmith added

Was able to capture another log and see something very similar, though not exactly the same has Javier.

[ 1341.014520] olpc_do_sleep!
[ 1341.014520] olpc-ec:  running cmd 0x34
[ 1341.014520] olpc-ec:  running cmd 0x24
[ 1341.014520] olpc-ec:  running cmd 0x24
[ 1341.014520] olpc-ec:  timeout waiting for EC to read command!

In this case we ran 0x24 but the second time it fails. The system comes out of suspend and then goes to sleep again, but we wake up immediately:

[ 1343.091325] olpc-ec:  running cmd 0x1c
[ 1343.094392] olpc-ec:  received 0xff
[ 1343.094392] olpc-ec:  running cmd 0x1b
[ 1343.094392] olpc-ec:  sending cmd arg 0xbb
[ 1343.094392] olpc-ec:  timeout waiting for EC to quiesce!
[ 1343.094392] olpc-ec:  running cmd 0x84
[ 1343.094392] olpc-ec:  received 0x20
[ 1343.094467] olpc-pm:  SCI 0x20 received
[ 1343.094484] olpc-ec:  running cmd 0x84
[ 1343.104392] olpc-ec:  received 0x0
[ 1343.104471] olpc-pm:  SCI 0x0 received
...
[ 1344.271923] devpath 1 ep2out 3strikes
[ 1344.271923] usb8xxx: URB in failure status: -71
[ 1344.271923] devpath 1 ep3in 3strikes
[ 1344.282012] PM: Finishing wakeup.
[ 1344.285342] Restarting tasks ... done.
# Go back to sleep 
[ 1344.344508] PM: Preparing system for mem sleep
[ 1344.352450] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 1344.361923] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 1344.365272] PM: Entering mem sleep
[ 1344.368684] Suspending console(s)

At this point the system just died. Chris, we still haven't reproduced the case you saw, correct?

I'm adding Richard to this as it appears to be EC related.

in reply to: ↑ 14 ; follow-up: ↓ 17   Changed 6 years ago by rsmith

Replying to dsaxena:

Was able to capture another log and see something very similar, though not exactly the same has Javier. {{{ [ 1341.014520] olpc_do_sleep! [ 1341.014520] olpc-ec: running cmd 0x34 [ 1341.014520] olpc-ec: running cmd 0x24 [ 1341.014520] olpc-ec: running cmd 0x24 [ 1341.014520] olpc-ec: timeout waiting for EC to read command! }}} In this case we ran 0x24 but the second time it fails. The system comes out of suspend and then goes to sleep again, but we wake up immediately:

something is funky with the above trace. It shows zero time elapsing between the all 3 ec commands and the timeout which is highly unlikely.

> [ 1343.091325] olpc-ec:  running cmd 0x1c
> [ 1343.094392] olpc-ec:  received 0xff
> [ 1343.094392] olpc-ec:  running cmd 0x1b
> [ 1343.094392] olpc-ec:  sending cmd arg 0xbb
> [ 1343.094392] olpc-ec:  timeout waiting for EC to quiesce!
> [ 1343.094392] olpc-ec:  running cmd 0x84
> [ 1343.094392] olpc-ec:  received 0x20

Ditto here. Yet it claims to have received data.

I'm adding Richard to this as it appears to be EC related.

If we can reproduce this on a somewhat reliable basis then I need to dup it on a machine that has an EC serial port loaded so I can see what gong on in the EC.

Also there an adjustable knob for the EC timeout you can try increasing that and see if the timeout goes away.

  Changed 6 years ago by dsaxena

Ugh, this is getting ugly.

I ran three test cycles with EC timeout set to 40 ms and I got hangs, but with different behaviour than the ones above.

Once the system continued to respond to pings after 101 S/R cycles but keyboard, serial console, and mouse input all did not work and system did not suspend again. The kernel was obviously still running, but it looks like userland was spinning on something.

Second time, after 2361 S/R cycles, the system seems to have spontaneously rebooted while I was gone. The console was working but the screen was not on. I hit the power button and the system went into suspend and now cannot get out of suspend as OHM puts the system right back to sleep upon wakeup from any source.

Third time after 532 cycles, I saw the "+r" from OFW but nothing from kernel.

None of these cases had EC timeouts or 3strikes in the logs. As soon as I set the timeout back to 20ms, I saw this again after 13 cycles, so it appears that there are two (or more?) separate suspend/resume lockup issues we're dealing with (see #7479 for another error).

I also saw the following after the 3strikes. Chris, is this the original error you saw when you reported this?

[  336.063874] libertas: Command 1f timed out
[  336.078771] libertas: requeueing command 1f due to timeout (#1)

Next Steps:

  • For the EC issues with the default 20ms timeout, Richard needs to log data on the EC side.
  • I'll go read #2621 in detail to understand the 3strikes issue.
  • Will rerun with test suspend mode, as was done in #2621
  • I'll run testing kernel with CONFIG_PROVE_LOCKING to see if it catches a deadlock that might be showing up in the 40ms timeout case.
  • Merge KGDB or KDB into testing kernel so I can poke at the system state when a hang does occur if no deadlock is found via above.
  • Run joyride with master kernel to see if same issues show up. Not much of the XO-specific code has changed between testing and master, but there maybe some other change in the x86 codepath that impacts this so I will just do this as a quick sanity check.

Any other ideas, please throw them my way. :)

in reply to: ↑ 15   Changed 6 years ago by dsaxena

  • status changed from new to assigned

Replying to rsmith:

If we can reproduce this on a somewhat reliable basis then I need to dup it on a machine that has an EC serial port loaded so I can see what gong on in the EC.

Basic test I am running is to associate with an AP and then run the following on the XO:

echo 9 > /proc/sys/kernel/printk
count=1
while true
do
        echo mem > /sys/power/state
        echo "***** Completed $count sleep cycles ****"
        count=$((count + 1))
done

And on the host just do a "ping <xo.ip.adrr>"

You will see the lockup anywhere from immediately to several 1000 cycles later.

  Changed 6 years ago by dsaxena

Ran suspend/resume test with master kernel and default 20ms timeout. Saw the 3strikes issue and the following at the end of the dmesg log:

[ 5013.726535] devpath 1 ep2out 3strikes
[ 5013.726535] usb8xxx: URB in failure status: -71
[ 5018.726576] libertas: Command 1f timed out
[ 5018.736382] libertas: requeueing command 1f due to timeout (#9)
[ 5018.736382] devpath 1 ep2out 3strikes
[ 5018.736382] usb8xxx: URB in failure status: -71
[ 5023.736734] libertas: Command 1f timed out
[ 5023.746717] libertas: requeueing command 1f due to timeout (#10)
[ 5023.746717] devpath 1 ep2out 3strikes
[ 5023.746717] usb8xxx: URB in failure status: -71
[ 5028.746658] libertas: Command 1f timed out
[ 5028.756565] libertas: Excessive timeouts submitting command 1f
[ 5028.756565] Resetting OLPC wireless via EC...
[ 5028.761025] olpc-ec:  running cmd 0x25
[ 5028.787862] BUG: spinlock lockup on CPU#0, lbs_main/785, ce9321dc
[ 5028.787862] Pid: 785, comm: lbs_main Tainted: G        W 2.6.26-rc9 #1
[ 5028.787862]  [<c04e6657>] _raw_spin_lock+0xac/0xc8
[ 5028.787862]  [<c06042ad>] _spin_lock_irq+0x18/0x1b
[ 5028.787862]  [<d09456ab>] lbs_thread+0x44e/0x6e7 [libertas]
[ 5028.787862]  [<c0410ec0>] ? default_wake_function+0x0/0xd
[ 5028.787862]  [<d094525d>] ? lbs_thread+0x0/0x6e7 [libertas]
[ 5028.787862]  [<c0425506>] kthread+0x39/0x5f
[ 5028.787862]  [<c04254cd>] ? kthread+0x0/0x5f
[ 5028.787862]  [<c0403bf7>] kernel_thread_helper+0x7/0x10
[ 5028.787862]  =======================

Also noticed that we're running into the WARN_ON() in olpc_ec_mask_set() every time. Currently doing a run of master with 40ms timeout but will focus mainly on poking at the testing kernel.

follow-up: ↓ 20   Changed 6 years ago by cjb

Could you attach the full dmesg?

Changed 6 years ago by dsaxena

Suspend/resume test log on 2.6.26

in reply to: ↑ 19 ; follow-up: ↓ 21   Changed 6 years ago by dsaxena

Replying to cjb:

Could you attach the full dmesg?

I've attached my full console log from running my test.

in reply to: ↑ 20 ; follow-up: ↓ 22   Changed 6 years ago by jcardona

Replying to dsaxena:

Replying to cjb:

Could you attach the full dmesg?

I've attached my full console log from running my test.

That console log does not show ec commands, nor the useful 'olpc_do_sleep!'. I'd recommend that you include those in future captures or else we only see half of the story.

in reply to: ↑ 21 ; follow-up: ↓ 24   Changed 6 years ago by dsaxena

Replying to jcardona:

Replying to dsaxena:

Replying to cjb:

Could you attach the full dmesg?

I've attached my full console log from running my test.

That console log does not show ec commands, nor the useful 'olpc_do_sleep!'. I'd recommend that you include those in future captures or else we only see half of the story.

Javier, I just downloaded the same log and verified that it (and my original) do show the EC commands and the 'olpc_do_sleep!'. md5sum cea7a1029e3bf4ac19981bc9a5cf28d4 on the log file.

  Changed 6 years ago by dsaxena

Ran a test for about 12 hours with 2.6.26 and 40ms timeout. Unfortunately I managed to not save my log, but from the last few suspend/resume cycles I saw, there were no EC timeout issues and no 3strikes. Even so, after 7000+ tests, the system locked up.

in reply to: ↑ 22   Changed 6 years ago by jcardona

Replying to dsaxena:

Replying to jcardona:

Replying to dsaxena:

Replying to cjb:

Could you attach the full dmesg?

I've attached my full console log from running my test.

That console log does not show ec commands, nor the useful 'olpc_do_sleep!'. I'd recommend that you include those in future captures or else we only see half of the story.

Javier, I just downloaded the same log and verified that it (and my original) do show the EC commands and the 'olpc_do_sleep!'. md5sum cea7a1029e3bf4ac19981bc9a5cf28d4 on the log file.

You are absolutely right. Now I'm looking at the right file. Sorry for the noise.

  Changed 6 years ago by pgf

  • cc pgf added

  Changed 6 years ago by dsaxena

I did some quick test of running with the new firmware (#7528) with default 20ms EC timeout and did not see 3strikes but did see EC timeouts and lockups within < 1000 cycles. Once just lock with no messages from wifi, second time an URB -2 failure.

Changed 6 years ago by rsmith

  Changed 6 years ago by rsmith

Have not seen any EC timeouts yet but I got the above ooops

Changed 6 years ago by rsmith

follow-ups: ↓ 29 ↓ 30   Changed 6 years ago by rsmith

oops2.log is another oops. I stopped the log since the kernel appeared to be in runaway oops mode. Eventually it did recover.

I do see timeouts in the log but nothing in my EC log indicates anything is wrong. Currently I'm very suspect of the fact that the timestamp on the printk does not increase between ec commands. That's something I've not see in prior kernels.

in reply to: ↑ 28 ; follow-up: ↓ 31   Changed 6 years ago by dsaxena

Replying to rsmith:

oops2.log is another oops. I stopped the log since the kernel appeared to be in runaway oops mode. Eventually it did recover. I do see timeouts in the log but nothing in my EC log indicates anything is wrong. Currently I'm very suspect of the fact that the timestamp on the printk does not increase between ec commands. That's something I've not see in prior kernels.

Investigating the timestamp oddness and the way that printk() is getting timestamps has changed from 2.6.22 to 2.6.25. In 2.6.22 we just read the TSC and used that for the printk() time whereas in the new kernel we're doing calling a wrapper function (update_rq_clock) that does some range checking. Looking at this wrapper, I believe it may be possible to end up with duplicate timestamps when coming out of resume, but I need to add some debug prints and dig a bit more into the timer code to validate.

I doubt that this is related to EC timeouts, but will test to rule it out for sure as we've got too many variables we're dealing with in this bug and I need to narrow down what to look at.

in reply to: ↑ 28   Changed 6 years ago by dsaxena

Replying to rsmith:

oops2.log is another oops. I stopped the log since the kernel appeared to be in runaway oops mode. Eventually it did recover.

So this is a totally new oops I haven't seen before. The libertas commands are failing with a -62 (timer expired) instead of -71 or -2 that we've seen with our testing so far. This failure causes us to get a bunch of NETDEV_TIMEOUTS. When we try to sleep again, the USB suspend fails with -110 (timeout). The kernel sees that one of the tasks refused to suspend and dumps a full backtrace of every process in the system.

in reply to: ↑ 29   Changed 6 years ago by dsaxena

Replying to dsaxena:

Investigating the timestamp oddness and the way that printk() is getting timestamps has changed from 2.6.22 to 2.6.25. In 2.6.22 we just read the TSC and used that for the printk() time whereas in the new kernel we're doing calling a wrapper function update_rq_clock) that does some range checking. Looking at this wrapper, I believe it may be possible to end up with duplicate timestamps when coming out of resume, but I need to add some debug prints and dig a bit more into the timer code to validate.

Did a quick test with replacing the call to cpu_clock() (which ends up calling update_rq_clock()) with a call to sched_clock() and see the duplicate timestamps go away but still see the timeouts and lockup;

[    1.714207] olpc_do_sleep!
+r[    0.005745] olpc-ec:  running cmd 0x34
[    0.022585] olpc-ec:  running cmd 0x24
[    0.043417] olpc-ec:  running cmd 0x24
[    0.067252] olpc-ec:  timeout waiting for EC to read command!
[    0.073016] olpc-ec:  running cmd 0x1c
[    0.088828] olpc-ec:  received 0xbb
[    0.092363] olpc-ec:  running cmd 0x1b
[    0.102154] olpc-ec:  sending cmd arg 0xff
[    0.122242] olpc-ec:  running cmd 0x84
[    0.138055] olpc-ec:  received 0x20

So while the timestamps are somewhat bogus, it is expected (comment for cpu_clock()) states "high-speed (but slightly incorrect) per-cpu clock constructed from sched_clock()"), confusing, and in my opinion should be fixed upstream, it is not related to the timeout and lockup issues.

  Changed 6 years ago by dsaxena

Was running another test and stumbled upon this scenario. I hit a lockup we've seen before, where we got timeouts and at somepoint the system stops responding in the middle of going to sleep:

[    2.549039] PM: Syncing filesystems ... done.
[    2.554305] PM: Preparing system for mem sleep                                       
[    2.559303] Freezing user space processes ... (elapsed 0.00 seconds) done.
[    2.573643] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.      
[    2.581921] PM: Entering mem sleep

I hit <ctrl>-<c> on serial console and the system came back:


[   56.856441] olpc-ec:  running cmd 0x1c                                             
[   56.873266] olpc-ec:  received 0xff
[   56.876778] olpc-ec:  running cmd 0x1b                                               
[   56.886569] olpc-ec:  sending cmd arg 0xbb
[   56.907746] olpc-ec:  running cmd 0x32                                               
[   56.921724] olpc_do_sleep!
+r[    0.005723] olpc-ec:  running cmd 0x34                                             
[    0.021556] olpc-ec:  running cmd 0x24
[    0.042385] olpc-ec:  running cmd 0x24                                               
[    0.063217] olpc-ec:  running cmd 0x1c
[    0.080031] olpc-ec:  received 0xbb                                                  
[    0.083541] olpc-ec:  running cmd 0x1b
[    0.092331] olpc-ec:  sending cmd arg 0xff

When we get to wireless resume, we see:

[    2.158817] list_add corruption. prev->next should be next (cda05884), but was cda0587c. (prev=ce3f54d0).
[    2.168417] ------------[ cut here ]------------
[    2.173053] Kernel BUG at c04e0e77 [verbose debug info unavailable]
[    2.179336] invalid opcode: 0000 [#1] PREEMPT
[    2.183848] Modules linked in: usb8xxx libertas ieee80211 mmc_block ieee80211_crypt pcspkr mousedev ehci_hcd sdhci ohci_hcd mmc_core cs5535_gpio usbcore
[    2.197851]
[    2.199366] Pid: 1809, comm: sleep_test Not tainted (2.6.25.11 #5)
[    2.205481] EIP: 0060:[<c04e0e77>] EFLAGS: 00010046 CPU: 0
[    2.210995] EIP is at __list_add+0x34/0x4a
[    2.215109] EAX: 00000070 EBX: ce3f5504 ECX: ce3c3000 EDX: 00000001
[    2.221392] ESI: ce3f5504 EDI: cda04420 EBP: ce3c3e30 ESP: ce3c3e1c
[    2.227675]  DS: 0068 ES: 007b FS: 0000 GS: 0033 SS: 0068
[    2.233090] Process sleep_test (pid: 1809, ti=ce3c3000 task=cdc6c890 task.ti=ce3c3000)
[    2.240762] Stack: c069c646 cda05884 cda0587c ce3f54d0 00000001 ce3c3e4c d094adf7 cda058bc
[    2.249349]        00000283 0000001f cda04420 cd328800 ce3c3e84 d094c951 00000000 cda04420
[    2.257935]        ce3f5504 ccfea000 d0ad7293 00000020 00000002 ccf47400 ccfd80c0 cda04420
[    2.266523] Call Trace:
[    2.269194]  [<d094adf7>] ? lbs_queue_cmd+0x115/0x1b1 [libertas]
[    2.275379]  [<d094c951>] ? lbs_prepare_and_send_command+0x14ba/0x164e [libertas]
[    2.283003]  [<d0ad7293>] ? usb_submit_urb+0x1d5/0x1f1 [usbcore]
[    2.289237]  [<d094402d>] ? lbs_resume+0x64/0xc8 [libertas]
[    2.294936]  [<d0935b6f>] ? if_usb_resume+0x5f/0xab [usb8xxx]
[    2.300777]  [<d0ad8e35>] ? usb_resume_interface+0x52/0xf4 [usbcore]
[    2.307300]  [<d0ad8f8e>] ? usb_external_resume_device+0xb7/0xeb [usbcore]
[    2.314347]  [<d0ad8ff3>] ? usb_resume+0x31/0x37 [usbcore]
[    2.319993]  [<c051e748>] ? device_resume+0x7a/0x128
[    2.325038]  [<c0431e66>] ? suspend_devices_and_enter+0x103/0x11f
[    2.331225]  [<c0431fb4>] ? enter_state+0x110/0x16c
[    2.336194]  [<c04320a3>] ? state_store+0x93/0xab
[    2.340971]  [<c0432010>] ? state_store+0x0/0xab
[    2.345663]  [<c04d99a5>] ? kobj_attr_store+0x1a/0x22
[    2.350792]  [<c048daa6>] ? sysfs_write_file+0xb0/0xdb
[    2.356022]  [<c048d9f6>] ? sysfs_write_file+0x0/0xdb
[    2.361146]  [<c0457cda>] ? vfs_write+0x8c/0x10e
[    2.365855]  [<c0458196>] ? sys_write+0x3b/0x60
[    2.370465]  [<c0403812>] ? sysenter_past_esp+0x5f/0x85
[    2.375778]  =======================
[    2.379364] Code: 39 d0 74 14 51 50 52 68 f6 c5 69 c0 e8 bd 4b f3 ff 0f 0b 83 c4 10 eb fe 8b 10 39 ca 74 14 50 52 51 68 46 c6 69 c0 e8 a3 4b f3 ff <0f> 0b 83 c4 10 eb fe 89 5a 04 89 13 89 43 04 89 18 8b 5d fc c9
[    2.399892] EIP: [<c04e0e77>] __list_add+0x34/0x4a SS:ESP 0068:ce3c3e1c
[    2.406578] ---[ end trace a7fd94e034e86e3b ]---
[    2.411217] note: sleep_test[1809] exited with preempt_count 1
[    2.416979] BUG: sleeping function called from invalid context at kernel/rwsem.c:21
[    2.424652] in_atomic():1, irqs_disabled():1
[    2.428947] Pid: 1809, comm: sleep_test Tainted: G      D  2.6.25.11 #5
[    2.435497]  [<c040f4a6>] __might_sleep+0x98/0x9f
[    2.440262]  [<c05f449e>] down_read+0x15/0x23
[    2.444670]  [<c04330fe>] acct_collect+0x36/0x14c
[    2.449422]  [<c0417943>] do_exit+0x189/0x5cc
[    2.453837]  [<c0415a2f>] ? printk+0x15/0x17
[    2.458182]  [<c0414da2>] ? print_oops_end_marker+0x1e/0x23
[    2.463834]  [<c04041e9>] die+0x113/0x11b
[    2.467895]  [<c05f566b>] do_trap+0x7d/0x85
[    2.472135]  [<c040439c>] ? do_invalid_op+0x0/0x76
[    2.477001]  [<c0404408>] do_invalid_op+0x6c/0x76
[    2.481754]  [<c04e0e77>] ? __list_add+0x34/0x4a
[    2.486448]  [<c05f5184>] ? _spin_unlock_irqrestore+0x17/0x2c
[    2.492279]  [<c0415455>] ? release_console_sem+0x193/0x19b
[    2.497933]  [<c04159fd>] ? vprintk+0x2b3/0x2d0
[    2.502539]  [<d0ad6eeb>] ? usb_hcd_submit_urb+0x83e/0x91b [usbcore]
[    2.509075]  [<c05f5392>] error_code+0x6a/0x70
[    2.513575]  [<c041007b>] ? sched_show_task+0x7f/0xa6
[    2.518706]  [<c0450000>] ? unmap_hugepage_range+0xc/0x42
[    2.524183]  [<c04e0e77>] ? __list_add+0x34/0x4a
[    2.528878]  [<d094adf7>] lbs_queue_cmd+0x115/0x1b1 [libertas]
[    2.534830]  [<d094c951>] lbs_prepare_and_send_command+0x14ba/0x164e [libertas]
[    2.542253]  [<d0ad7293>] ? usb_submit_urb+0x1d5/0x1f1 [usbcore]
[    2.548435]  [<d094402d>] lbs_resume+0x64/0xc8 [libertas]
[    2.553935]  [<d0935b6f>] if_usb_resume+0x5f/0xab [usb8xxx]
[    2.559575]  [<d0ad8e35>] usb_resume_interface+0x52/0xf4 [usbcore]
[    2.565901]  [<d0ad8f8e>] usb_external_resume_device+0xb7/0xeb [usbcore]
[    2.572747]  [<d0ad8ff3>] usb_resume+0x31/0x37 [usbcore]
[    2.578203]  [<c051e748>] device_resume+0x7a/0x128
[    2.583042]  [<c0431e66>] suspend_devices_and_enter+0x103/0x11f
[    2.589019]  [<c0431fb4>] enter_state+0x110/0x16c
[    2.593781]  [<c04320a3>] state_store+0x93/0xab
[    2.598359]  [<c0432010>] ? state_store+0x0/0xab
[    2.603049]  [<c04d99a5>] kobj_attr_store+0x1a/0x22
[    2.607980]  [<c048daa6>] sysfs_write_file+0xb0/0xdb
[    2.613010]  [<c048d9f6>] ? sysfs_write_file+0x0/0xdb
[    2.618142]  [<c0457cda>] vfs_write+0x8c/0x10e
[    2.622635]  [<c0458196>] sys_write+0x3b/0x60
[    2.627046]  [<c0403812>] sysenter_past_esp+0x5f/0x85
[    2.632160]  =======================

The first BUG in there is from the wireless driver's lbs_queue_cmd() function appending to a list that has been corrupted. The second BUG I'm still trying to trace down completely, but given that it has to do with sleeping at the wrong time, the fact that I could wake the system up from being stuck in some context, the list corruption, the message "sleep_test[1809] exited with preempt_count 1", and some other behaviour I've seen I ran a test with CONFIG_PREEMPT disabled to make 100% sure that our issues are not due to a preemption bug. I reproduced the exact same crash, so one more variable out of the way.

Will analyze the crash path and other crashes in detail and see if I can find a correlation in all of them other than the EC timeout. One note of interest is that -stable kernel is hard coded with 40ms timeout.

  Changed 6 years ago by dsaxena

  • cc mbletsas added

I've spent some time running many many test iterations to find a pattern and get a better understanding of where the issue might be. I started by adding debug code to various paths in the kernel to trace the suspend path as we go down the kernel stack the main issue I was seeing is a lockup in the Libertas driver's suspend path where the driver submits a sleep request and the the device never responds.

I wanted to separate out the EC timeouts from the wireless lockup so looked at the few changes in the EC handling code from 2.6.22 to 2.6.25, backed them out, and ran the tests again. No luck.

Next I started looking at the driver itself and trying to see if there are any major changes in the suspend/resume path. Since the code has been massively updated between stable and 2.6.25, I ended up just replacing the full 2.6.25 driver with the 2.6.22 driver. Ran tests again, no luck.

I was using 2.6.22 as a baseline b/c it is a know quantity and I assumed that it did not have the same issues since we're shipping 703 and 708 based on it. This turned out to be an incorrect assumption. When I tested with the 2.6.22 kernel, I was able to reproduce a suspend/resume lockup under stress, but somewhat different from what I saw with 2.6.25.

I also wanted to rule out the firmware as a source of the issue, so I backed out the firmware that we are including in Joyride (5.110.22.p14) and dropped in the firmware from build 703 (5.110.22p1). Still got a lock up, but again, somewhat different.

Following is a summary of the various failure cases I saw in my extensive testing. All tests were performed with ec_timeout set to 40ms, using XO firmware q2e11. Test consists of XO looping continuously and going to sleep with a host continuously pinging the XO.


2.6.25 kernel + 2.6.25 driver + old fw

  • Sometimes see long (>> 10000ms) ping delays once we start sleep loop
  • After a few sleep cycles, kernel stops at olpc_do_sleep(), and does not wake back up from WLAN ("Destination Host Unreachable" messages on host).
  • ctrl-c on console does not do anything while stuck
  • Moving mouse, power button, or other input wakes system back up but goes back to waiting state on next sleep cycle.
  • Can stop the sleep loop by hitting ctrl-c in the small window between wake and sleep in above case and system immediately starts responding to pings.

2.6.25 kernel + 2.6.25 driver + new fw

Failure Mode 1:

  • Sporadic ping response times after sleep looping beings, up to 20000ms
  • After several cycles (anywhere from 30 to several 100s), system gets stuck waiting for libertas to respond to sleep command.
  • Even though the wireless card did not respond back (or the kernel did not receive the response), we are receiving WLAN wake up events from the EC:
      olpc-ec:  running cmd 0x84
      olpc-ec:  received 0x20
      olpc-pm:  SCI 0x20 received
      olpc-ec:  running cmd 0x84
      olpc-ec:  received 0x0
      olpc-pm:  SCI 0x0 received

Javier, Richard, is it possible to receive a 0x20 SCI if the WLAN chip is not actually in suspend mode?

  • Hitting ctrl-c on serial console causes sleep cycle to complete but upon waking up we get a list corruption bug in libertas driver's lbs_queue_cmd() function. What this points to is that the driver is still processing when it should not. At this point the system is completely dead.

Failure Mode 2:

  • System hits olpc_do_sleep(), but we see an EC error when disabling EC_SCI_SRC_BATSOC && EC_SCI_SRC_ACPWR events (Not sure why we do this...Dilinger?):
        /* We don't want to wake up on superfluous events */
        olpc_ec_mask_unset(EC_SCI_SRC_BATSOC | EC_SCI_SRC_ACPWR);
[  295.122289] olpc-ec:  running cmd 0x1c
[  295.122289] olpc-ec:  timeout waiting for EC to read command!
[  295.122289] olpc-pm:  error getting SCI mask: -5
  • Kernel tells EC to sleep:
    [  295.122289] olpc-ec:  running cmd 0x32
    [  295.122289] olpc_do_sleep!
    
  • We wake up from wlan event but kernel hangs on trying to wake up the wireless device:
+r[  295.122289] olpc-ec:  running cmd 0x34
[  295.122289] olpc-ec:  running cmd 0x24
[  295.122289] olpc-ec:  running cmd 0x24

2.6.25 kernel + 2.6.22 driver + new fw

  • Sleeps once, wakes up, does not sleep again. Driver gets stuck waiting for libertas to respond to sleep command
  • Hitting ctrl-c on console causes sleep cycle to complete. Touchpad/keybd/power button does not.
  • Waking system back up casues list corruption bug in lbs_queue_cmd(), but is not fatal. System still running but network interface is dead at this point.

2.6.25 kernel + 2.6.22 driver + old fw

  • Sleeps once, wakes up, does not sleep again. Gets stuck waiting for libertas to respond to sleep command
  • Hitting ctrl-c on console causes sleep cycle to complete.
  • Touchpad/keybd/power button does not.
  • System does not have network connectivity at this point.
  • Hitting ctrl-c on console at this point does nothing
  • Touchpad/keybd/power causes system to wakeup and crash with a spinlock error in the khub kernel driver.

2.6.22 kernel + old fw

  • Hangs at olpc_do_sleep() after a number of cycles.
  • Keybd/mouse/pwr wakes system up
  • Can wake system up and hit ctrl-c to stop sleep loop and system immediately starts responding to pings.

2.6.22 kernel + new fw

  • Hangs at suspend of libertas driver
  • ctrl-c completes suspend, stops at olpc_do_sleep()
  • Waking up with touchpad/keyboard leads to list corruption in lbs_queue_cmd()
  • System completely hung at this point.

Note that even 2.6.22 kernel + old FW, which is what we are shipping in 703, does not always successfuly wake-on-lan, it just happens to not be fatal.

I'm fairly certain the root cause of this is a race somewhere in the kernel/EC/Libertas suspend stack. The kernel is sending a command to the device and it may or not be succeeding but if it is, we're not getting notified in time. Due to the race, we're getting out of sync, either driver <-> libertas, or driver <-> EC. How this ends up impacting the system can drastically change based on timing of the system. For example, if I enable full debug output from the libertas driver ({{{echo 0xffffffff > /sys/module/libertas/libertas_debug) I get a very different set of errors from the driver compared to having it just set to 0.

I'm not sure what role the EC timeouts play in this whole mess. With the 40ms timeout, I'm barely seeing them but still running into the same crashes so they might be orthogonal to the issues we're seeing here; however, they still need to be looked into.

To really root cause this I think we're going to have to look at debug output of all three components of the system (kernel, EC, wireless) to see what they are sending and how each component is reacting. Our goal is to ship latest kernel + firmware, so we need to focus on that specific combination.

Next steps:

  • Javier, can you do some debug on your end to see what's happening in the firmware when we hit the condition where the kernel does not get a response to sleep? I can post my debug bits for kernel side.
  • I will try to track down the lbs_queue_cmd() list corruption in the kernel and look for general racyness in the generic olpc and driver suspend/resume path.
  • Richard/Dilinger/me should look into the EC timeouts.

  Changed 6 years ago by dwmw2

  • cc dwmw2 added

  Changed 6 years ago by dsaxena

  • summary changed from Intermitent suspend/resume lockup to Intermittent suspend/resume lockup during WOL stress testing

follow-up: ↓ 37   Changed 6 years ago by dsaxena

  • summary changed from Intermittent suspend/resume lockup during WOL stress testing to Intermittent lockup during WOL suspend/resume stress

in reply to: ↑ 36 ; follow-up: ↓ 38   Changed 6 years ago by rsmith

Replying to dsaxena:

Looking at my EC logs the EC is processing the commands. Perhaps there is still some sort of race in the way we deal with the IBF and OBF flags. I need to study the way the kernel looks and the flags and then think about how the EC clears them.

Please try my latest firmware:

 http://dev.laptop.org/~rsmith/q208n.rom

This has my fastpath EC command code in it and may give us some different results. If its some sort of set/clear race before the kernel every gets change to see the flag then this code may make it worse since in the simple case this code will process the command in a single pass of the EC processing loop.

in reply to: ↑ 37   Changed 6 years ago by dsaxena

Replying to rsmith:

Please try my latest firmware:  http://dev.laptop.org/~rsmith/q208n.rom This has my fastpath EC command code in it and may give us some different results. If its some sort of set/clear race before the kernel every gets change to see the flag then this code may make it worse since in the simple case this code will process the command in a single pass of the EC processing loop.

This definitely made things worse in terms of triggering the lockups with the ping test. I can suspend/resume just fine with wakeup-by-touchpad but if I am pinging the system when I suspend, I hit the lockup within one cycle with the default 20ms timeout. With 40ms timeout things work, but still hitting the lockup.

follow-up: ↓ 40   Changed 6 years ago by dilinger

Regarding the EC timeouts (it would be nice to have that in a separate bug, but for now):

We see two different timeouts; one is before the command starts (timeout-waiting-to-quiesce), and one after sending the command (timeout-waiting-for-command-read). In both cases, we're waiting for the IBF flag to be cleared. There was a theory about the LPC bus floating during suspend, which could cause IBF to be in an unknown state immediately after resume.. however, if we pass the IBF-quiesce check, then we can be assured that IBF is clear before sending the EC our command byte. So, instead:

15:06 =dilinger>         outb(cmd, 0x6c);
15:06 =dilinger>         if (wait_on_ibf(0x6c, 0)) {
15:06 =dilinger>                 printk(KERN_ERR "olpc-ec:  timeout waiting for EC to read "
15:06 =dilinger>                                 "command!\n");
...
15:08 =dilinger> we stuffed a byte into 0x6c, which presumably sets IBF
15:09 =dilinger> and then time out waiting for IBF to be clear
15:09 =smithbone> yeah.. writes to 6c set ibf.
15:09 =dilinger> we have a mdelay(1) in our IBF poll loop
15:09 * smithbone wonders if I when I clear the flag if it really gets clear.
15:10 =smithbone> I can clear then check. 
15:10 =dilinger> is it possible for the EC to clear IBF and then set it again 
                 during that 1ms delay?
15:10 =dilinger> *nod*
15:10 =smithbone> Oh.. yeah...
15:10 =smithbone> If I stick the data back in the register.
15:10 =smithbone> That could totally happen.
15:11 =smithbone> And fastpath would make it much worse.
15:11 =dilinger> so, um...
15:11 =smithbone> Hmmm.... 
15:11 =smithbone> heh..
15:11 =dilinger> is there a way we can prevent that? :)
15:12 =dilinger> either by the kernel telling the EC that it's waiting for IBF, 
                 and the EC not clearing IBF until the kernel has said that 
                 it's no longer waiting..
15:12 =smithbone> Dunno.. Let me think.. ok time for my con call.. Excellent 
                  brainstorming.. I think we may be on to it.
15:12 =dilinger> or by changing the kernel code to not delay 1ms (which still 
                 seems racy, and also is going to hurt performance)
15:13 =smithbone> yeah.. let me ponder a bit.
15:13 =dilinger> or by having the EC detect that the kernel is busy polling 
                 IBF, and not touch it until it's been at least 2ms since the 
                 kernel read?

in reply to: ↑ 39   Changed 6 years ago by rsmith

Replying to dilinger:

Regarding the EC timeouts (it would be nice to have that in a separate bug, but for now):

Lots of stuff changed since this last post. We need to test again for timeouts with the latest EC code.

  Changed 6 years ago by rsmith

  • keywords joyride-2131:- removed
  • milestone changed from 8.2.0 (was Update.2) to 9.1.0

  Changed 6 years ago by dsaxena

  • keywords blocks+:9.1.0 added
  • blocking 7393 removed

I am going through all kernel bugs marked as 9.1 or future release and updating their status, next action, etc in preparation of 9.1 bug scrubbing and future release planning.

This bug, along with #7594 was the main reason to not enable aggressive suspend resume in 8.2 from what I recall. We need to start running these tests again and if we still hit this, come up with a plan to attack at all levels (kernel, ec, WLAN firmware, etc)....

  Changed 4 years ago by pgf

the list_add corruption issues are probably the result of #10177.

the majority of crash.log and oops2.log are due to #10176.

neither of these has yet been fixed, but at least they're better understood.

Note: See TracTickets for help on using tickets.