Opened 6 years ago

Last modified 6 years ago

#8615 new defect

nand test crashes 760

Reported by: pgf Owned by: dwmw2
Priority: blocker Milestone: 9.1.0-cancelled
Component: kernel Version: not specified
Keywords: cjbfor9.1.0 Cc: wad, pgf, dilinger, dwmw2, mbletsas
Blocked By: Blocking:
Deployments affected: Action Needed: diagnose
Verified: no

Description (last modified by gregorio)

wad has been running nand tests on a set of laptops. after 12 hours, 5 of his laptops have crashed with jffs2-related backtraces.

screen photos attached.

Attachments (9)

dscn1176.jpg (924.1 KB) - added by pgf 6 years ago.
screenshot 1
dscn1177.jpg (813.5 KB) - added by pgf 6 years ago.
another screenshot
dscn1178.jpg (821.8 KB) - added by pgf 6 years ago.
jffs2.log (908.1 KB) - added by wad 6 years ago.
This is the serial console log from a laptop that recreated this bug
crash-log-92308.txt (258.0 KB) - added by wad 6 years ago.
Another console log from a machine showing this bug
jffs1-92908.log (65.8 KB) - added by wad 6 years ago.
Logs from a crash running dsaxena's debug kernel
jffs1-100108.log (26.7 KB) - added by wad 6 years ago.
Another kernel crash using dsaxena's debug kernel
x54-92908.log (76.3 KB) - added by wad 6 years ago.
Crash log from laptop running build 656 stock kernel
fix-gc-race.patch (975 bytes) - added by dwmw2 6 years ago.
Potential fix.

Change History (34)

Changed 6 years ago by pgf

screenshot 1

Changed 6 years ago by pgf

another screenshot

Changed 6 years ago by pgf

comment:1 Changed 6 years ago by pgf

comment:2 Changed 6 years ago by pgf

  • Summary changed from nand teste crashes 760 to nand test crashes 760

comment:3 Changed 6 years ago by dilinger

I notice that there are prior oopses. It's pretty important to have the first oops backtrace. Can you log this via serial console?

comment:5 Changed 6 years ago by pgf

  • Cc wad pgf added

yes, we'll set up the test with serial connected.

comment:6 Changed 6 years ago by dsaxena

  • Cc dilinger added

Changed 6 years ago by wad

This is the serial console log from a laptop that recreated this bug

comment:7 Changed 6 years ago by wad

The crash starts with:
BUG: unable to handle kernel paging request at 00100100

After running the same test overnight on three laptops, only one recreated the problem. There was something special about starting from a fresh image install that encouraged more laptops to show this bug sooner. I will reinstall a fresh image and try to recreate again.

comment:8 Changed 6 years ago by dsaxena

  • Cc dwmw added

Changed 6 years ago by wad

Another console log from a machine showing this bug

comment:9 Changed 6 years ago by wad

I just wasn't patient enough. Another of the machine running test overnight with a serial console crashed. The console log is attached.

comment:10 Changed 6 years ago by dsaxena

Please install http://dev.laptop.org/~dsaxena/debuginfo-8467/kernel-2.6.25-20080925.1.olpc.f10b654367d7065.i586.rpm and rerun the test. We have debuginfo packages for this kernel, meaning that I can look at the crash dump and an objdump from the vmlinux file to start digging at this.

comment:11 Changed 6 years ago by gregorio

  • Action Needed changed from never set to diagnose
  • Description modified (diff)
  • Milestone changed from Not Triaged to 8.2.1

comment:12 Changed 6 years ago by wad

An update on the status of testing. I installed the new kernel RPM on three machines running with serial console logging, and have started them running. Since that first night when five machines crashed, there hasn't been a time when all five crashed overnight. Reinstalling the OS from scratch on two of these didn't trigger the crash again, so I don't know what triggered that first night.

I also had a machine w. build 656 running the same test, and it never crashed with the kernel oops, even after writing/erasing 100 GB to the disk. It did have other issues (JFFS2 mysteriously lost space, and the test failed for that reason).

comment:13 Changed 6 years ago by mbletsas

  • Cc mbletsas added

Changed 6 years ago by wad

Logs from a crash running dsaxena's debug kernel

comment:14 Changed 6 years ago by wad

Finally got a crash with the debug kernel.

And I have seen the same crash on the stock kernel in build 656, so this problem is definitely already out in the field.

Changed 6 years ago by wad

Another kernel crash using dsaxena's debug kernel

comment:15 Changed 6 years ago by wad

Both the laptop running 656 (w. stock kernel) and the one running 760 (w. dsaxena's debug kernel) crashed again overnight. Their JFFS2 filesystems seems to have gotten into a state which triggers this problem.

comment:16 follow-up: Changed 6 years ago by wad

[emailed from dsaxena]

What would be good is you could boot from an SD card and do a dump of the flash device since you think the FS in a bad state.

Do you mean a save-nand of the built-in Flash ? (That doesn't require me to build
a bootable SD card) Or did you have something else in mind ? (dd | tar doesn't seem
right for JFFS2 filesystems...)

Changed 6 years ago by wad

Crash log from laptop running build 656 stock kernel

comment:17 in reply to: ↑ 16 Changed 6 years ago by dsaxena

Replying to wad:

[emailed from dsaxena]

What would be good is you could boot from an SD card and do a dump of the flash device since you think the FS in a bad state.

Do you mean a save-nand of the built-in Flash ? (That doesn't require me to build
a bootable SD card) Or did you have something else in mind ? (dd | tar doesn't seem
right for JFFS2 filesystems...)

Yes, it looks like save-nand will give me what I want (a JFFS2 image).

comment:18 Changed 6 years ago by wad

You can find a save-nand image from laptop JFFS1 (the one that has crashed repeatedly) at:
http://dev.laptop.org/~wad/nand/crash/

The write throughput of a laptop running JFFS2 degrades greatly over time. The laptop gets to a state where data which could be previously written can no longer be written. Even after freeing up another 32 MB (only 20MB was being written)so the test can proceed, writes remain very slow. For timings showing this happening to two different laptops, see:
http://dev.laptop.org/~wad/nand/alt/JFFS2/

comment:19 Changed 6 years ago by dwmw2

  • Cc dwmw2 added; dwmw removed

comment:20 Changed 6 years ago by dwmw2

  • Owner changed from dsaxena to dwmw2

Hm, we are calling the thread_should_wake() function from the GC thread without locking. And it goes wandering through the block lists.... no wonder it crashes :)

Changed 6 years ago by dwmw2

Potential fix.

comment:21 Changed 6 years ago by wad

The description of the tests required to trigger this have been moved to a publicly accessible page:

http://wiki.laptop.org/go/NAND_Testing

comment:22 Changed 6 years ago by dsaxena

  • spec_reviewed set to 0
  • spec_stage set to unknown

comment:23 Changed 6 years ago by wad

The preliminary report is that the patch seems to fix the problem. I've had five laptops with full Flash running the new kernel for over 48 hours now, and no kernel panics. I'll close out the ticket if another 48 hours pass without a problem.

comment:24 Changed 6 years ago by mstone-xmlrpc

  • Keywords cjbfor9.1.0 added
  • Milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

comment:25 Changed 6 years ago by dsaxena

I am going through all kernel bugs marked as 9.1 or future release and
updating their status, next action, etc in preparation of 9.1 bug
scrubbing and future release planning.

The attached patch has not been committed upstream from what I can tell
but I'll go ahead and merge into our 2.6.27 kernel. David, can you merge
this upstream?

Note: See TracTickets for help on using tickets.