Opened 4 years ago

Closed 4 years ago

#12688 closed defect (fixed)

8787 wifi multicast list management corrupts memory

Reported by: Quozl Owned by: dsd
Priority: blocker Milestone: 13.2.0
Component: kernel Version: Development build as of this date
Keywords: Cc:
Blocked By: Blocking:
Deployments affected: Action Needed: add to build
Verified: no

Description

An XO-4 B1 SKU292 was loaded with 13.2.0-7, followed by a yum update kernel and a reboot. /runin/soiled and /runin/force were created. Runin did hang with:

  • static display, without the text below Remaining : 23:19:22,
  • camera indicator on,
  • power indicator off (it will eventually power off).

Attachments (9)

runin_console.log (160.2 KB) - added by rsmith 4 years ago.
mywifiex_modules_removed_and_no-wireless.log (41.3 KB) - added by rsmith 4 years ago.
os8.log (89.0 KB) - added by wad 4 years ago.
os8_SHC247006E7_20130531_033609.gz (530.8 KB) - added by shep 4 years ago.
page fault in kworker after 12 hours of agressive runin, os8 on XO-4
relevant_excerpt_from_os8_SHC247006E7_20130531_033609 (3.7 KB) - added by shep 4 years ago.
so the crash I got reminds me of
os8_SHC247006E7_20130531_161252.gz (373.7 KB) - added by shep 4 years ago.
another kworker page fault crash shortly after bluetooth and mwifiex activity on resume
relevant_excerpt_from_os8_SHC247006E7_20130531_161252 (3.4 KB) - added by shep 4 years ago.
this one reminds me very much of #12197
xo-4-b1-sku292-build-8-no-btmrvl-graphics-hang.jpg (1.2 MB) - added by Quozl 4 years ago.
XO-4 B1 SKU292 display, close-up, using USB microscope at six inch focus
os9_4.log (149.9 KB) - added by wad 4 years ago.
log from hang running os9

Change History (52)

comment:1 Changed 4 years ago by Quozl

A runin hang reproduced on XO-4 C2 SKU306 at Remaining : 22:55:16, with battery indicator on, and text below Remaining visible.

kern.log in logdir/ hasn't shown anything useful.

comment:2 Changed 4 years ago by Quozl

  • Action Needed changed from never set to reproduce
  • Milestone changed from Not Triaged to 13.2.0
  • Version changed from not specified to Development build as of this date

A runin hang reproduced on XO-4 B1 SKU292 at Remaining : 20:08:31, same pattern as above.

comment:3 Changed 4 years ago by Quozl

  • Action Needed changed from reproduce to diagnose

http://dev.laptop.org/~quozl/z/1UgujW.txt has an exciting kernel crash of the XO-4 B1 SKU292 reproducing this problem for the third time, at Remaining : 22:42:26.

comment:4 follow-up: Changed 4 years ago by dsd

  • Cc dsd added

What about the other 2 times, did you have serial connected?

I suspect that the crash shown in the above log is random, possibly caused by memory corruption or something, it would be interesting to know if it can be reproduced again.

I wonder what the significance of the power indicator being off is.

Changed 4 years ago by rsmith

comment:5 in reply to: ↑ 4 Changed 4 years ago by Quozl

Replying to dsd:

What about the other 2 times, did you have serial connected?

No, and there was nothing in the filesystem logs. (I wish we had something like the drm_kms_helper that my desktop system has, which switches to console graphics mode when a panic occurs. Not sure if it would help, but it would mean we could get details from systems that didn't have a serial logger connected.)

I suspect that the crash shown in the above log is random, possibly caused by memory corruption or something, it would be interesting to know if it can be reproduced again.

Agreed. Multiple hits will tell us more about it. It reproduces within a few hours, so it should not take too long to get data.

I wonder what the significance of the power indicator being off is.

Probably none. runin-battery directs the EC to disconnect the external DC several times. The fact that the power indicator has been found to be on as well as off suggests that it has nothing to do with the problem.

comment:6 follow-up: Changed 4 years ago by Quozl

An XO-4 B1 SKU292 hung differently:

  • no kernel output,
  • static display, Remaining : 19:35:23,
  • camera, power, battery, and storage indicators on,
  • storage indicator flashing off at a rate dependent on ambient light.

Last serial console output was: http://dev.laptop.org/~quozl/z/1UhBVc.txt

comment:7 in reply to: ↑ 6 Changed 4 years ago by Quozl

Replying to Quozl:

Last serial console output was: http://dev.laptop.org/~quozl/z/1UhBVc.txt

Curiously, /runin/logdir/kern.log had more text after the last serial console output: http://dev.laptop.org/~quozl/z/1UhBbi.txt

comment:8 Changed 4 years ago by Quozl

An XO-4 B1 SKU292 hung:

Reviewing my earlier reports here, I think it is unlikely that I saw a power indicator off, and I probably conflated it with battery indicator off. Sorry about that.

comment:9 Changed 4 years ago by dsd

That last kernel output and the earlier crash both happen from the ifconfig process. Its far from certain but that might suggest something mwifiex related.

Has this been seen on any systems with 8686 instead of 8787?

comment:10 Changed 4 years ago by wad

  • Priority changed from normal to blocker

Duplicated with another log at #12691

comment:11 Changed 4 years ago by rsmith

Attaching a log of an 8787 machine but with mwifiex modules deleted and the no-wireless option enabled. Still crashed.

comment:12 Changed 4 years ago by dsd

Richard also reports that it seems to be a regression introduced between 13.2.0 build 2 and build 3.

I'm having trouble reproducing this. Clean flash of build 7 on two XO-4's, updated the kernel to fix the bluetooth crash, modified fscheck to not report failure. touch /runin/force && reboot - both still running after an hour.

comment:13 Changed 4 years ago by Quozl

Overnight tests here:

  • XO-4 B1 SKU293 with 8686, did not reproduce over 18 hours,
  • XO-4 B1 SKU292 with 8787, reproduced after 6 hours, http://dev.laptop.org/~quozl/z/1UhRm5.txt
  • XO-4 C2 SKU306 with 8787, reproduced after 10 hours, no serial cable mounted.

Confirming the reproducer:

fs-update u:\32007o4.zd
boot
rpm -i kernel-3.5.7_xo4-20130523.1701.olpc.36da52f.armv7hl.rpm
touch /runin/{soiled,force}
reboot

Environment is two APs beaconing an 802.11g ESSID "qz", with one of them beaconing an 802.11n ESSID "n" on 5GHz only. Regular traffic with the APs from other systems.

ifconfig is used frequently in the runin-wlan test.

comment:14 Changed 4 years ago by Quozl

comment:15 Changed 4 years ago by Quozl

I have reconfigured my test bed to do very quick runin, 60 or 90 seconds, using the BT tag, in an attempt to also catch #12692. Results so far:

I observe the kernel timestamps after panic are not monotonic.

comment:16 Changed 4 years ago by Quozl

comment:17 Changed 4 years ago by Quozl

comment:18 Changed 4 years ago by dsd

James, could you attempt to verify Richard's finding that build 2 is stable and build 3 is where the problem first appears?

comment:19 Changed 4 years ago by Quozl

comment:20 Changed 4 years ago by Quozl

Interim results of testing build 3 alone, with 60 second runin, over about 12 hours:

  • XO-4 B1 SKU292, 450 cycles pass,
  • XO-4 C2 SKU306, 461 cycles pass.

So I don't agree that build 3 is where this problem first appears.

(I didn't end up mixing kernels and builds, as the build 3 kernel would not shutdown properly on the build 7 user space).

comment:21 Changed 4 years ago by dsd

wad's testing also suggests that build 3 is not the first bad build, 3 and 5 seem stable for him as well. He's continuing the bisection.

Quozl/Richard, could you start testing the latest build with all runin scripts disabled, then enable them one by one until it goes unstable, or something like that?

comment:22 Changed 4 years ago by wad

First, an embarrassing clarification that earlier in this ticket I meant camera LED when I said mic LED.

Testing with os3, os5 for this problem was negative. Testing os6 resulted in immediate hangs on the first suspend on all laptops. This could be fixed by blacklisting btmrvl, upon which all units seemed to run fine.

Testing os8 resulted in all test units (two SKU306 and one SKU301) hanging after around 300 to 550 suspend/resume cycles (power LED on, camera LED on/off, mic LED off), which is more than normal. Perhaps this isn't the same problem... A console log from one of the units is attached.

Changed 4 years ago by wad

Changed 4 years ago by shep

page fault in kworker after 12 hours of agressive runin, os8 on XO-4

Changed 4 years ago by shep

so the crash I got reminds me of

comment:23 Changed 4 years ago by shep

So the crash I got reminds me of the stack corruption bug of #12197 (while the other logs attached by rsmith and wad do not, though suffering from #12197 did include lots of random inexplicable failures).

In my crash with the previous line:

[27206.848127] Bluetooth: vendor=0x2df, device=0x911a, class=255, fn=2
[27207.704365] Unable to handle kernel paging request at virtual address 03033f04

And from an earlier resume, so we can see what the next line would normally be:

[27190.797931] Bluetooth: vendor=0x2df, device=0x911a, class=255, fn=2
[27191.873034] mwifiex_sdio mmc0:0001:1: WLAN FW already running! Skip FW dnld

So the prime suspects for my crash are these modules:

btmrvl_sdio btmrvl mwifiex_sdio mwifiex 

I have restarted runin on this machine, we'll see what happens.

When I fixed #12197 I did look around a little bit within mwifiex for any similar misuse of the stack and didn't find any others. (But I could have missed one.) I never did look at the btmrvl and btmrvl_sdio for such similar bugs.

comment:24 follow-up: Changed 4 years ago by dsd

Thanks Tim. wad's results today (build 7 is the first bad one?) also point towards bluetooth being a likely culprit.

I did manage to reproduce a hang with runin on build 8 on my C1 today. No serial, but I assume its the same thing.

I will run it over the weekend with build 8 with bluetooth modules deleted from the disk. If anyone else has machines free, it would be worth repeating this test.

Changed 4 years ago by shep

another kworker page fault crash shortly after bluetooth and mwifiex activity on resume

Changed 4 years ago by shep

this one reminds me very much of #12197

comment:25 Changed 4 years ago by Quozl

Build 8, with only the btmrvl_sdio module deleted from filesystem, and with 60 second runin (tags TS rnin, BT 60):

comment:26 Changed 4 years ago by Quozl

Build 8, with both btmrvl and btmvrl_sdio modules deleted from filesystem, and with 60 second runin (tags TS rnin, BT 60) hung on XO-4 B1 SKU292 with no panic message, and a corrupted graphics display. Console output: http://dev.laptop.org/~quozl/z/1Uie0o.txt Photograph to be attached.

Changed 4 years ago by Quozl

XO-4 B1 SKU292 display, close-up, using USB microscope at six inch focus

comment:27 Changed 4 years ago by Quozl

Build 8, with both btmrvl and btmvrl_sdio modules deleted from filesystem, and with 60 second runin (tags TS rnin, BT 60):

I think this excludes btmrvl and btmrvl_sdio from consideration as direct cause.

comment:28 Changed 4 years ago by Quozl

Build 8, with both btmrvl and btmvrl_sdio modules deleted from filesystem, with touch /runin/no-wireless, and with 60 second runin (tags TS rnin, BT 60): the two systems have passed 1621 cycles.

I think this suggests ifconfig, iw, mwifiex and scan results as causes.

comment:29 Changed 4 years ago by Quozl

Regarding the above results, when BT is 60 and normal suspend timing is used, there is never enough time (20 minutes) for a suspend cycle to be started. So these crashes might be considered unrelated to suspend and resume.

comment:30 in reply to: ↑ 24 Changed 4 years ago by dsd

Replying to dsd:

Thanks Tim. wad's results today (build 7 is the first bad one?) also point towards bluetooth being a likely culprit.

I did manage to reproduce a hang with runin on build 8 on my C1 today. No serial, but I assume its the same thing.

I will run it over the weekend with build 8 with bluetooth modules deleted from the disk.

The weekend test crashed as well in the same way, which suggests that it is not bluetooth-related.

comment:31 Changed 4 years ago by Quozl

Build 8, with all modules present, with touch /runin/no-wireless, and with 60 second runin (tags TS rnin, BT 60): has run without problem for 26 hours on XO-4 B1 SKU292.

This suggests the actions in /runin/runin-wlan are helpful in reproducing the problem. These actions are in a loop run constantly, with heavy scheduling pressure:

  • ifconfig eth0 down
  • sleep 2
  • ifconfig eth0 up
  • sleep 2
  • iw eth0 scan freq 2437
  • sleep 10

These actions themselves don't reproduce the problem when used on an idle system.

comment:32 Changed 4 years ago by dsd

I confirm that wireless is the suspect item, based on a similar test: I re-ran the same test that I ran at the weekend (XO-4 C2, build 8, bluetooth modules deleted, regular runin - this crashed) with the additional change of removing the mwifiex modules from disk, and it has survived 23 hours.

I also ran an overnight test on my XO-4 B1 (8686), runin unmodified, and it survived.

I will now test builds 6 and 7 (with bluetooth disabled but 8787 wifi enabled) to confirm the earlier suspicion that build 6 works OK even with wireless, and build 7 is the first crashy build.

comment:33 Changed 4 years ago by dsd

Same test procedure. Build 7 crashed, build 6 as well. Build 5 has been running 22 hours without crashing.

comment:34 Changed 4 years ago by dsd

I think the first bad commit is 84c2bb (this one crashed on both my test XOs), and the previous commit 5057bba is good (pending an overnight test, running now)

comment:35 Changed 4 years ago by Quozl

Build 8, with all modules present, with 60 second runin (tags TS rnin, BT 60), and with no-sdwrite and no-fscheck flag files, hung on XO-4 B1 SKU292 with no panic message, and a corrupted graphics display, the same as seen earlier in Comment 26. Console log: http://dev.laptop.org/~quozl/z/1UkQjm.txt

The test was restarted. Some hours later, a panic: http://dev.laptop.org/~quozl/z/1UkQlL.txt

comment:36 Changed 4 years ago by dsd

Overnight test passed, 84c2bb7effacced073495e823270e6d6323aa3e0 is the first bad commit that introduces the instability that I can reproduce here.

comment:37 Changed 4 years ago by dsd

I found a bug in the codepath triggered by that commit - some memory was being used uninitialized, causing a huge memcpy() to corrupt memory. Fixed in arm-3.5 e7f43ffecaf34d212cd3b536ca7e564a77c84f50

comment:38 Changed 4 years ago by dsd

This new kernel on build 8 otherwise unmodified survived the overnight test on 2 XOs.

comment:39 Changed 4 years ago by Quozl

Same here, total test time 54 hours on two XO-4.

comment:40 Changed 4 years ago by dsd

  • Action Needed changed from diagnose to add to build
  • Cc dsd removed
  • Component changed from not assigned to kernel
  • Owner set to dsd
  • Summary changed from XO-4 13.2.0 build 7 runin hang to 8787 wifi multicast list management corrupts memory

Thanks, ready for the next build then.

comment:41 Changed 4 years ago by wad

Sorry to mention this, but I captured the attached log today from a C2 laptop running 13.2.0 os9.

It appears to be the same problem.

Changed 4 years ago by wad

log from hang running os9

comment:42 Changed 4 years ago by Quozl

Yes, looks like what I've opened #12701 for.

comment:43 Changed 4 years ago by dsd

  • Resolution set to fixed
  • Status changed from new to closed

This issue (corruption due to multicast list stuff) has not been seen again after a lot of testing. The patch is now upstream too.

Note: See TracTickets for help on using tickets.