Opened 11 years ago

Closed 11 years ago

Last modified 6 years ago

#9707 closed defect (fixed)

occasional oops in sdio_remove_func() in S/R

Reported by: John Watlington Owned by: Deepak Saxena
Priority: blocker Milestone:
Component: kernel Version: 1.5-B2
Keywords: Cc:
Blocked By: Blocking:
Deployments affected: Action Needed: no action
Verified: no

Description

On an XO-1.5 B2 prototype, running q3a16 and os42, I set up the laptop as indicated at: http://wiki.laptop.org/go/XO1.5_SD_suspend#Automating_resumes

I set up autowack from OFW, enabled the EC as a wakeup source using ACPI, then started a loop running, suspending the machine after a second of uptime.

The WLAN modules were removed from /lib/modules, and manually removed after boot as well (they are also provided by the initrd ?). NetworkManager was stopped.

After 1441 cycles without a crash, the laptop crashed on resume the 1442nd time with a "unable to handle kernel paging request" message.

This was the first time I had managed to get this script to run more than 140 - 150 times. Every time the crash symptoms appeared the same...

The serial port output is attached.

Attachments (2)

tmp (34.0 KB) - added by John Watlington 11 years ago.
Serial port output from the laptop which crashed
p1_crash.log (17.2 KB) - added by John Watlington 11 years ago.
serial log from last couple of cycles before a crash on B2 prebuild #1

Download all attachments as: .zip

Change History (31)

Changed 11 years ago by John Watlington

Attachment: tmp added

Serial port output from the laptop which crashed

comment:1 Changed 11 years ago by dsd

Summary: Crash during resume on XO-1.5occasional oops in sdio_remove_func() in S/R

Do we know if this oops happened during suspend or resume?

comment:2 Changed 11 years ago by dsd

mmc_attach_sdio() seems to be silently encountering an error and then crashing somewhere in the error handling path. The first question is what is failing and why. I just committed a patch to add some printks which will give more of a clue where the error is. I'll update this ticket later when there is a build available.

comment:3 Changed 11 years ago by dsd

Chris points out that it might not be silently failing, because the console log level is so low we wouldn't even be seeing error messages over serial.

John, please run "dmesg -n 7" before your test. (there will be a lot of info spewed to serial on every suspend/resume though, might slow it down considerably?)

And the kernel referenced above is ready at http://xs-dev.laptop.org/~dsd/repos/f11-xo1.5/kernel-2.6.31_xo1.5-20091119.1110.1.olpc.646f97d.i586.rpm

Changed 11 years ago by John Watlington

Attachment: p1_crash.log added

serial log from last couple of cycles before a crash on B2 prebuild #1

comment:4 Changed 11 years ago by James Cameron

Keywords: q3a16 os42 b2 added

triage.

comment:5 Changed 11 years ago by Deepak Saxena

Tried to reproduce this on a B2 (not fully ECO'd into a B3) and went over 5K S/R cycles w/o any issues before I accidentally pulled the plug.

comment:6 Changed 11 years ago by John Watlington

That may be significant. I have been testing this on B2's without any ECOs in that department (one of the test machines has 32b DDR2).

Try it again and try not to pull the plug... I have run to 2300 and 2500 cycles, but haven't seen a 5K, and many sooner.

comment:7 Changed 11 years ago by James Cameron

Have two B2 units now running this test, 300 cycles so far, will leave them to run overnight.

comment:8 Changed 11 years ago by dsd

Summary: occasional oops in sdio_remove_func() in S/Roccasional oops in sdio_remove_func() in suspend

So the crash happens during suspend.

Some interesting messages from before the crash.

During resume:

[ 2109.784459] mmc2: error -110 during resume (card was removed?)
[ 2109.790912] mmc2: card 0001 removed
[ 2109.794800] sdhci_set_ios: power off for mmc2 from b0669941
[ 2109.800622] sdhci_set_power: new power value = 0
[...]
[ 2109.915527] mmc2: new SDIO card at address 0001
[ 2109.920402] mmc mmc2:0001: parent mmc2 should not be sleeping

Do those messages always appear during resume on a B3?

The messages are different when the crash happens. The final 2 lines do not appear during the resume that happens immediately beforehand, as if the card was never redetected. It would be interesting to see logs from another crash to confirm that these conditions are identical.

mmc_rescan is a workqueue-based task. Maybe the difference in the crash case is that it wasn't scheduled until suspend started to happen. (actually thats probably not possible because processes get frozen)

Or to attack the problem from a different angle, here's one plausible way we could have reached mmc_rescan in the suspend path:

sdhci_pci_suspend() attempts to suspend the host, but if it fails then it attempts to resume it: sdhci_resume_host() --> mmc_resume_host() --> mmc_detect_change() --> schedule_delayed_work(&host->detect) --> mmc_rescan()

comment:9 Changed 11 years ago by dsd

asked on the linux-mmc list for input: http://marc.info/?l=linux-mmc&m=125870709414481&w=2

comment:10 Changed 11 years ago by James Cameron

The two B2s here failed to proceed beyond 2004 cycles and 2216 cycles respectively. They stopped within four minutes of each other. They had been powered up at the same time.

comment:11 Changed 11 years ago by dsd

oh, wait, all these tests are being done on a B2? So, the thing its choking on is the wifi card (mmc2).

comment:12 Changed 11 years ago by John Watlington

dsd,

Yes, the thing it is chocking on is the WiFi card. I removed the Wifi cards from two B2s, and they haven't crashed yet (5400 and 2300 cycles). I also performed the WLAN PWR ECO on two laptops (which keeps the WLAN powered in suspend) and they haven't crashed either (although they haven't run enough to be conclusive.)

BTW, I am simultaneously running my SD wear testing at the same time on all these machines. So far, I haven't seen any errors generated in the suspend/resume testing.

comment:13 Changed 11 years ago by John Watlington

I'm suspecting that we might be occasionally powering the WLAN card back up too soon, causing the occasional failure to communicate after power-on.

I'm running three experiments:

  • two machines without WLAN card, one running for 7600 cycles without a crash
  • two machines with the WLAN power ECO, which maintains power to the card.
  • one machine running with a 4 sec. delay before restart (instead of 80 ms).

So far, none have crashed.

comment:14 Changed 11 years ago by James Cameron

I'm running a test with one B2 with d# 4000 autowack-delay.

comment:15 Changed 11 years ago by John Watlington

A laptop without a WLAN card made it to 9100 cycles before I moved it to testing a delay of d# 1000 autowack-delay.

comment:16 in reply to:  14 Changed 11 years ago by James Cameron

Replying to Quozl:

I'm running a test with one B2 with d# 4000 autowack-delay.

This test terminated at around the 1000 cycle mark, at kernel uptime 4381 seconds, with a kernel panic at sdio_remove_func+0x9/0x27 and process kmmcd. Code: 04 85 d2 75 14 83 c2 0c f7 02 ff 00 ff ff ...

comment:17 Changed 11 years ago by John Watlington

Yep. While they seemed to run longer (two runs to around 5K), increasing the autowack_delay doesn't prevent the inevitable crash.

comment:18 Changed 11 years ago by cjb

Wad, seen any failures on either ECO'd B2s or B3s yet? Or are we still talking about a failure that only affects the un-ECO'd B2s?

comment:19 Changed 11 years ago by dsd

I'd say the problem we should really be looking into is this one:

[ 2109.784459] mmc2: error -110 during resume (card was removed?)

This happens during resume a long time before the crash (you can see in p1_crash.log that it occurs during resume, then system suspends, then it resumes with the same error, then it suspends, and at this point the crash occurs). I presume this message occurs on every resume after a certain point. It certainly doesn't occur during a normal suspend/resume (tested twice on my own B2).

In terms of the crash itself, I'm still not entirely sure why we are hitting mmc_rescan() in the suspend path. However, Matt Fleming on the linux-mmc list pointed out a bug in the error path which we are hitting which would explain the crash.

The other question is then why mmc_rescan() is failing, specifically when initializing a func on the wifi card.

I added some debug messages to all of the above code paths as well as a crash-fix based on Matt's observation, commit 9e0570c3844428128791a2363787af4022cc37db. Please test with new kernel and post new logs. Note that it might not crash any more, the point of interest is simply when the "mmc2: error -110 during resume (card was removed?)" messages start appearing.

comment:20 Changed 11 years ago by dsd

Cc: dsd added

comment:22 Changed 11 years ago by John Watlington

cjb, at this point, it seems safe to say that this error only occurs on laptops which are power cycling the WLAN card on every suspend/resume. I have two B2 laptops with the WLAN power ECO which ran for 48K S/R cycles over the weekend, and another without a WLAN card that also ran 48K cycles. All also completed 6500 cycles (120 GB of writes) of SD wear levelling testing at the same time, with no errors.

comment:23 Changed 11 years ago by John Watlington

Cc: dsd removed
Keywords: q3a16 os42 b2 removed
Milestone: 1.5-hardware-C1.5-hardware-B
Summary: occasional oops in sdio_remove_func() in suspendoccasional oops in sdio_remove_func() in S/R

Somewhat anticipating a development like this, we did run a WLAN reset signal to the southbridge. In XO-1, we never got the WLAN module to reliably reset itself on power-up. On XO-1.5, a similar reset signal (WLAN_RESET#, active low) goes to pin GPIO0 (powered in suspend, guaranteed to power up high or deasserted for an active low signal).

I suggest that we fix the bug in the mmc_rescan error handling code, and add a quirk to the libertas driver that asserts a side-channel reset signal (in our case through a GPIO) whenever the device is powered up.

comment:24 Changed 11 years ago by Paul Fox

on a B3, with up-to-date kernel (i.e., i have daniel's patches), with an associated wlan, my very first suspend/resume resulted in:

[  414.481447] mmc1:0001: error -123 reading SDIO_CCCR_INTx
[  414.481490] mmc_sdio_resume: init card failed error -123

my wireless is now inactive, and there's no eth0.

comment:25 in reply to:  21 Changed 11 years ago by John Watlington

Replying to dsd:

new kernel available for testing: http://xs-dev.laptop.org/~dsd/repos/f11-xo1.5/kernel-2.6.31_xo1.5-20091123.0440.1.olpc.9e0570c.i586.rpm

dsd, I don't know what you did in this kernel, but I have two unmodified B2s (which are cycling power to the SD card every S/R cycle) running it for over 35K cycles.

comment:26 Changed 11 years ago by James Cameron

Milestone: 1.5-hardware-B1.5-software-beta

comment:27 Changed 11 years ago by dsd

Resolution: fixed
Status: newclosed

OK. The fix has been sent upstream and included in olpc-2.6 for the next build.

The question still remains, why did the card start failing to resume?, but these are outdated systems and we have plenty of other tickets about the wifi-resume problems. So I think we're done here.

comment:28 Changed 11 years ago by James Cameron

Action Needed: diagnoseno action

Was verified fixed in os64.

comment:29 Changed 6 years ago by James Cameron

Milestone: 10.1.0-beta

Milestone 10.1.0-beta deleted

Note: See TracTickets for help on using tickets.