Ticket #1905 (closed defect: fixed)

Opened 7 years ago

Last modified 4 years ago

Field Return: flash corruption - OpenFirmware complaining of 'unknown node type 2006'.

Reported by: dwmw2 Owned by: wad
Priority: high Milestone: 8.2.0 (was Update.2)
Component: hardware Version:
Keywords: Cc: wmb@…, dwmw2, jg, gary, Luna.Huang@…, Elvis.Wu@…
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description (last modified by kimquirk) (diff)

SHF70500388

A B2 machine was handed to me which failed to boot from NAND, with OpenFirmware complaining of 'unknown node type 2006'.

This is a somewhat bogus diagnostic message from OpenFirmware. It _does_ understand the node type 0x2006, which is a summary node. It's just that these ones have bad CRCs.

There seems to have been corruption on the write path, between CPU, RAM and CAFÉ. An example...

01fdb310  30 17 eb 15 21 00 00 3b  85 19 01 e0 36 00 00 00  |0...!..;....6...|
01fdb320  a4 e1 55 df 60 05 00 80  3d 06 00 00 3f 06 00 00  |..U.`...=...?...|
01fdb330  a1 a5 0a 46 0e 08 00 00  06 7e be ae 18 18 99 b3  |...F.....~......|
01fdb340  70 69 6e 6b 5f 72 6f 75  6e 64 2e 67 69 66 ff ff  |pink_round.gif..|

This provokes the following report from the kernel:

JFFS2 notice: (2554) read_direntry: header CRC failed on dirent node at 0x1fdb318: read 0xaebe7e06, calculated 0x1432b5ee

The parent inode value of 0x8000560 looks very suspicious. Flipping the msb of the byte at 01fdb327 back to a more reasonable 0x00 makes the crc32 match what's on the flash.

There are no ECC errors reported -- what's on the flash seems to be what reached the CAFÉ in the DMA transfer when this block was being written. So this doesn't seem to be an error between CAFÉ and NAND. And the crc32 seems sane too, so it doesn't seem likely that it's memory corruption or program error. I suspect hardware.

I'll look at other nodes (there are many broken ones) and see if there's a pattern to the corruption.

Attachments

trac-1905-messages.txt (320.6 kB) - added by dwmw2 7 years ago.
kernel output
nand bbt date.txt (401 bytes) - added by Luna 7 years ago.
problem1-1.JPG (271.4 kB) - added by Luna 7 years ago.
problem1-2.JPG (0.8 MB) - added by Luna 7 years ago.
problem2chip1-1.JPG (281.5 kB) - added by Luna 7 years ago.
problem2chip2-1.JPG (279.8 kB) - added by Luna 7 years ago.
problem2chip1-2.JPG (284.4 kB) - added by Luna 7 years ago.
problem2chip2-2.JPG (275.1 kB) - added by Luna 7 years ago.
problem2chip2-3.JPG (258.2 kB) - added by Luna 7 years ago.
wbuf-verify-panic.patch (4.4 kB) - added by dwmw2 7 years ago.
Patch (against older kernel) to make JFFS2 verify (most) NAND writes and panic on compare failure

Change History

  Changed 7 years ago by dwmw2

02159b20  85 19 02 e0 44 10 00 80  6d 58 d1 84 21 07 00 00  |....D...mX..!...|

JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x2159b20. {1985,e002,80001044,84d1586d}

Change the length from 0x80001044 to 0x1044 and the CRC works again...

  Changed 7 years ago by dwmw2

0298a2e0  6f 26 dc 27 16 ed 02 12  16 79 02 9a 85 19 02 e0  |o&.'.....y......|
0298a2f0  44 10 00 00 6d 58 d1 84  8a 09 00 00 16 00 00 00  |D...mX..........|
0298a300  fd 81 00 00 f4 01 f4 01  8e 56 02 00 a0 a5 0a 46  |.........V.....F|
0298a310  a0 a5 0a 46 a0 a5 0a 46  00 40 01 00 00 10 00 00  |...F...F.@......|
0298a320  00 10 00 00 00 00 00 80  0f d6 39 fe 3c 4e 01 39  |..........9.<N.9|

JFFS2 notice: (2554) read_dnode: node CRC failed on dnode at 0x298a2ec: read 0x39014e3c, calculated 0xd4b9cd1c

Flip byte at 0x0298a327 to zero to make it work. Again it's the most significant bit of a byte at 0xXXXXXXX7.

  Changed 7 years ago by dwmw2

Reading the image from the flash twice results in identical images -- if this is corruption happening on _read_, on DMA from the CAFÉ to the host memory, then it's 100% repeatable. I suspect it happened on write.

  Changed 7 years ago by dwmw2

Offending image file uploading to http://dev.laptop.org/~dwmw2/trac-1905-image.bin md5sum dffb1eee5b45d48b895e8b1b9efe06ed

Changed 7 years ago by dwmw2

kernel output

  Changed 7 years ago by dwmw2

  Changed 7 years ago by dwmw2

  • cc wmb@… added
  • owner changed from mlj to wad

  Changed 7 years ago by wad

  • status changed from new to assigned

Damn. So the working assumption is that we have occasional bit rot on PCI writes ?

Can we have more identifying information about the affected laptop (serial num., model B2-?, etc.)

As the PCI interface was redone for B3, in particular the bus master for writes to the CaFE is different silicon and the layout also changed, I suggest ensuring that we do enough testing to verify that this is not a problem on B3 and B4 machines, and listing this as a possible weak spot on B2s.

  Changed 7 years ago by wmb@…

Actually, it's not necessarily PCI writes. The corruption could just as easily occur in memory. The JFFS2 code writes the information into a memory-based write buffer before it is DMA-ed out to the NAND FLASH.

The failures that have been analyzed so far all show the same bit position within an 8-byte chunk. The memory array is 64 bits wide. If additional analysis continues to show that same bit error position, the memory hypothesis is stronger than the PCI hypothesis.

  Changed 7 years ago by dwmw2

Picking the ones that are dead easy to find the error by hand... these all happen to the msb of a byte at 0xXXX27, getting flipped from 0 to 1:

 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x530b20. {1985,e002,8000022b,f836c826}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x2159b20. {1985,e002,80001044,84d1586d}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x267eb20. {1985,e002,8000019a,5fbe5f80}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x4982320. {1985,e002,80000d2c,6ebdb7a2}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x53dc320. {1985,e002,800007a1,7c195c92}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0x9677b20. {1985,e002,800005fd,fe4638e6}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0xa6c0b20. {1985,e001,80000038,3f8a9697}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0xb7c8b1c. {1985,e002,00000649,ee105f72}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0xe63ab20. {1985,e002,80000628,ed8acf14}
 JFFS2 notice: (2554) jffs2_get_inode_nodes: Node header CRC failed at 0xef3cb20. {1985,e002,80000893,2d672b85}

The one at 0xb7c8b1c should have a CRC of 0x6e105f72. I haven't checked the rest of the CRCs but the bogus length field (0x80000xxx) is obvious.

  Changed 7 years ago by dwmw2

B2-7 SHF70500388

  Changed 7 years ago by dwmw2

Writing a small tool to check the summary nodes.... the one with 'Summary node is corrupt' got the top bit of the byte at 0x1027eb27 set. For the ones with 'Summary node data is corrupt', ...

Node at 0337ea0c: must flip top bit of byte at 0337eb27
Node at 0583e888: must flip top bit of byte at 0583eb27
Node at 05f1f174: must flip top bit of byte at 05f1f327
Node at 0751f03c: must flip top bit of byte at 0751f327
Node at 0783eec8: must flip top bit of byte at 0783f327
Node at 07f1efbc: must flip top bit of byte at 07f1f327
Node at 0835f618: must flip top bit of byte at 0835fb27
Node at 0971ed08: must flip top bit of byte at 0971f327
Node at 09c1f010: must flip top bit of byte at 09c1f327
Node at 0a09e9f0: must flip top bit of byte at 0a09eb27
Node at 0a61f0f8: must flip top bit of byte at 0a61f327
Node at 0b6bedbc: must flip top bit of byte at 0b6bf327
Node at 0c09f078: must flip top bit of byte at 0c09f327
Node at 0cd1f244: must flip top bit of byte at 0cd1f327
Node at 0d49ef38: must flip top bit of byte at 0d49f327
Node at 0e97ea8c: must flip top bit of byte at 0e97fb27
Node at 0ee5f12c: must flip top bit of byte at 0ee5f327
Node at 0f23df48: must flip top bit of byte at 0f23eb27
Node at 0fa1dd78: must flip top bit of byte at 0fa1e327
Node at 1003eaf0: must flip top bit of byte at 1003eb27
Node at 100be97c: must flip top bit of byte at 100beb27

  Changed 7 years ago by dwmw2

So far, they're all in the same byte (0x327) of the NAND page. Which means it's always the same byte in the pre-allocated DMA buffer which we're using in the CAFÉ driver.

  Changed 7 years ago by dwmw2

I wrote a tool to check data nodes too (after fixing the bug which prints the warning messages with the wrong offsets). All 1059 data nodes I checked have the same bitflip at offset 0x327 within the page.

  Changed 7 years ago by wmb@…

Perhaps run memtest on that machine?

  Changed 7 years ago by jg

  • priority changed from normal to blocker
  • milestone changed from Untriaged to Trial-2

memtest seems like a really good idea....

  Changed 7 years ago by jg

  • summary changed from flash corruption to flash corruption - OpenFirmware complaining of 'unknown node type 2006'.

  Changed 7 years ago by kimquirk

This is a duplicate of #1843 (same B2 serial number).

Leaving this one open; closing the other one.

  Changed 7 years ago by kimquirk

  • description modified (diff)
  • summary changed from flash corruption - OpenFirmware complaining of 'unknown node type 2006'. to Field Return: flash corruption - OpenFirmware complaining of 'unknown node type 2006'.

adding 'field return' to name

  Changed 7 years ago by dwmw2

cjb has a B4 which shows similar symptoms. In this case I think it was at byte 0x72f within the page, and the least significant bit got flipped from 1 to 0.

-000720 b2 fc a6 05 f0 fb 30 c2 72 63 4b cd be 95 24 6e
+000720 b2 fc a6 05 f0 fb 30 c2 72 63 4b cd be 95 24 6f

  Changed 7 years ago by dwmw2

Might be useful to turn on CONFIG_MTD_NAND_VERIFY_WRITE in the kernel. That should catch it if it's DMA errors, but not if it's happening in RAM after the CRC is generated by JFFS2.

  Changed 7 years ago by cjb

The two machines involved are now in jg's office.

  Changed 7 years ago by wmb@…

We need to find a synthetic test case that will recreate the problem

The problem is not in JFFS2 per se, but rather seems to occur in the memory buffer that is used to write blocks of data into NAND FLASH. Therefore, it should be possible to construct a test case that does not involve JFFS2 itself, thus opening up the list of people who could work on it.

Suggestions for making a test case:

1) Boot from USB and unmount the NAND FLASH so that JFFS2 is not accessing NAND.

2) Using raw access to mtd0, locate an erase block that does not currently contain any JFFS2 data (i.e. an erased block containing all 0xff's).

3) Fill a 2K (the NAND FLASH page size) memory buffer with pseudo-random data using a simple random number generator that can be reseeded to recreate the last sequence. For example,

   /* Linear congruence PRNG */
   seed = (seed * 1103515245 + 12345) & 0x7fffffff;

or

   /* 32-bit maximal period Galois Linear Feedback Shift Register */
   usigned int seed;
   seed = (seed >> 1) ^ (-(signed int)(seed & 1) & 0xd0000001u); // taps 32 31 29 1

4) Write the buffer to a page within the selected NAND FLASH erase block via the mtd0 driver.

5) Read back the data and check it against the pseudo-random sequence (regenerate the sequence from the same starting seed value in case corruption occurs in the top-level memory buffer). Do whatever is necessary to make sure that the data goes all the way out to NAND FLASH and is read back from there, lest internal caching affect the test results.

6) Repeat with different pages in the same 128K erase block until the erase block is filled up. Use different seed values for each page test. When the erase block is filled up, erase it and start over.

7) Do this under different system loads, different environmental conditions, etc, in hopes of provoking a failure.

  Changed 7 years ago by dwmw2

I've added a CONFIG_JFFS2_FS_WBUF_VERIFY option to the kernel, which will read back the page after each time we write the page-buffer (pages which are directly written completely because a single node is large enough to completely fill the page won't get the same treatment). It will attempt to recover by writing the affected nodes out again elsewhere.

This will catch errors introduced in DMA transfers, but not if the error is just corruption in the RAM itself. But if we were seeing RAM corruption I'd have expected it to have manifested itself in other ways too -- including the occasional node with a CRC32 matching the corrupted version of the contents instead of the original.

  Changed 7 years ago by dwmw2

I can put in code to re-check the CRC32 in the affected nodes, too, if necessary.

Running integck from the mtd-utils git tree should probably show the problem fairly quickly if it's happening.

  Changed 7 years ago by dwmw2

  Changed 7 years ago by jg

  • cc arnold.kao@…, matt.huang@… added

OK, I've added Arnold and Matt to the cc list since this is smelling like it might be a subtle hardware design issue; if so, with memory or flash, since it crosses B2's and B4's.

We've sequestered the two units for further investigation: we'll identify what parts are in the two units and see if the same RAM parts are in use, and get that data in here today.

  Changed 7 years ago by jg

  • cc dwmw2 added

  Changed 7 years ago by Bryan.Ma

  • cc jg, gary, luna added; arnold.kao@…, matt.huang@… removed

We also have a unit with the same symptom,B4 machine,Q2C18,513 build. After reflashing NAND,it looks healthy; with the number of reboot incresing and stuffing it with camera video, the complaints boomed from 5 times to numerous pages as well, but you still could boot. At last through several rebootings, the complaints flooded and OFW freezed while lefting the booting prompt sentence alone.

But the complaints stopped showing up after we change build to 529. One-day long test didn't make the message appear anymore.

Anything( jffs2 modification? ) differentiate between builds cause this ?

  Changed 7 years ago by kimquirk

Bryan, Can you put the serial number of the laptop you have that showed this problem.

  Changed 7 years ago by Bryan.Ma

The B4 unit(in fact,only the MainBoard) was shipped from Shanghai factory,
so I only have the serial number on MB: "QTFICA72300115".

Another observation we found:
1)Switch NAND flash from problem unit(unitA) to another healthy unit(UnitB), the symptom still show on unitB.
2)Reflash NAND with *the same* build 513, the symptom disappear at first,then gradually increasing until unitB fail to boot under OFW.
The methods we use to reflash are:

-select /nandflash
-scrub
-copy-nand nand513.img

If the cause for the complaints come from DMA transferring(or RAM corruption?),

we're confused why they're still there after swiching NAND to another well unit ?

  Changed 7 years ago by jg

  • milestone changed from Trial-2 to Trial-3

What's the next step on chasing this, please?

  Changed 7 years ago by gnu

Another useful test would be to have the kernel check for the bad 0x80 bit in the length field upon read-back. If it exists, then go back and see if the original memory buffer contains the bad bit. If it does not, you probably have a transient hardware problem (e.g. a RAM bit reads back wrongly sometimes, or a bus corruption, or a signal that's moving during the setup/hold time).

Has anyone considered the possibility of a bit-spreader bug in the kernel (i.e. software somewhere that writes to this byte when it shouldn't, through a bad pointer)? If there is a way to run this system under a logic analyzer or hardware monitor, trigger on accesses to this memory buffer location with 0x80 bit set. The first time it triggers, you can catch either a read (i.e. transient RAM or bus failure) or a write (bit spreader) that way. Or, change the code very slightly to use a different part of the RAM as the copy-buffer, and see if the problem persists.

Did I see in Bryan Ma's comment that moving the flash chips from the failing unit to a good unit caused the good unit to start failing? If he meant they desoldered the chips and physically moved them, that looks like a strong piece of evidence that it isn't a RAM or bus or bit-spreader problem; perhaps an incompatibility between CAFE and the NAND chips on timing or something.

Changed 7 years ago by Luna

follow-ups: ↓ 35 ↓ 36   Changed 7 years ago by Luna

  • cc Luna.Huang@…, Elvis.Wu@… added; luna removed

Dear David,

After talking over with Elvis, I would like to tell you the results as follows. Problem 1 is known by you and Problem 2 is the test I did.

According to the test results, I assume there may be some problem with nand flash erasing when the action is executed under OFW.

Please feel free to tell me if there is any problem.

Thank you. Luna ============================================================

Problem 1: ( Tested on the same CL1 )

If I execute command " nandwrite -p /dev/mtd" in OS, the CL1 will hang up in the process of nand write. Moreover, if I execute "copy-nand" under OFW, the nandwrite action keeps going even there is an error message (Refer to the attached:problem1-1, problem1-2.) and the machine can go into sugar successfully. But it will show a message: "Unsupported node type 2006" after reboot several times.

I think the bad blocks are not marked well so it gets chance to write into the blocks which have problem

Problem 2: ( Use two new Nand flash chips on the same machine to test)

I used two methods for the first chip:

(1) Put the chip into the ST nand flash test machine which comes from ST Cor. to get the initial bad block table which has 16 initial bad blocks.

(2) Solder the chip on the board first. Boot OLPC to OFW and execute command "select /nand " "scrub" "test /nandflash" in order.

It comes out 5 bad blocks only to skip and the BBTables is set on the latest two blocks (0x1ffe 0x1fff)

(3) Then I move the chip into the ST nand flash test machine and scanned it again.

There are 7 bad blocks in the chip. You can see the attachment (problem2chip1-1,problem2chip1-2) to get the bad address.

The second chip: (attachment: problem2chip2-1 ~ problem2chip2-3)

(1) Put another new chip into the ST nand flash test machine to get the initial bad block table which has one bad block (0x069e).

(2) Solder the chip on the board first. Boot OLPC to OFW and execute command "test /nandflash". It comes out 1 bad block (0x069e) and BBTable is also set on the latest two blocks

(3) Then I boot from USB. Execute command "flash_eraseall -j /dev/mtd0" to get the result with four more bad blocks (0x3ff80000 0x3ffa0000 0x3ffc000 0x3ffe000).

(4) Reboot OLPC to OFW to execute command "test /nandflash". Get one bad block (0x069e).

(5) Execute command "select /nand " "scrub" "test /nandflash" in order. It comes out no bad block and BBTable is set on the latest two blocks.

According to the test results, I assume there may be some problem with nand flash erasing when the action is executed under OFW.

Please feel free to tell me if there is any problem.

Thank you. Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

Changed 7 years ago by Luna

  Changed 7 years ago by Luna

  • owner changed from wad to dwmw2
  • status changed from assigned to new
  • component changed from hardware to kernel

in reply to: ↑ 33   Changed 7 years ago by dwmw2

Replying to Luna:

Problem 1: ( Tested on the same CL1 ) If I execute command " nandwrite -p /dev/mtd" in OS, the CL1 will hang up in the process of nand write.

When we say 'hang up', we normally mean that the computer stops responding. Your problem1-1.JPG doesn't show that -- it just shows that the nandwrite program encountered an error and stopped. The kernel returned an error when asked to write a block of data to the flash. If you type 'dmesg' at this point, you might see warning messages from the kernel, describing the failure. Do you have the CONFIG_MTD_NAND_VERIFY_WRITE configuration option enabled?

If this persists, we can make the kernel a little more verbose about such errors. Perhaps the nandwrite program ought to be marking the offending block as 'bad' when the write fails.

Moreover, if I execute "copy-nand" under OFW, the nandwrite action keeps going even there is an error message (Refer to the attached:problem1-1, problem1-2.) and the machine can go into sugar successfully. But it will show a message: "Unsupported node type 2006" after reboot several times.

That is a misleading error message. It _probably_ means you have similar corruption problems, but it's not clear. Can you verify whether the ECC data are correct for the affected blocks?

I think the bad blocks are not marked well so it gets chance to write into the blocks which have problem

Maybe. But if the ECC data on the flash match the data, that means that the corruption did not happen on the flash -- it happened in or before the CAFÉ chip. And if not, then you're describing a different problem which should be in a different trac report.

Please could you check the status of the ECC on the affected blocks, and the nature of the corruption (you could just let me have a copy of the image and the kernel's output when mounting it and I'll do the latter).

in reply to: ↑ 33   Changed 7 years ago by dwmw2

  • component changed from kernel to hardware

Replying to Luna:

Problem 2: ( Use two new Nand flash chips on the same machine to test) I used two methods for the first chip: (1) Put the chip into the ST nand flash test machine which comes from ST Cor. to get the initial bad block table which has 16 initial bad blocks. (2) Solder the chip on the board first. Boot OLPC to OFW and execute command "select /nand " "scrub" "test /nandflash" in order. It comes out 5 bad blocks only to skip and the BBTables is set on the latest two blocks (0x1ffe 0x1fff)

You should _never_ use 'scrub'. That destroys the factory-marked bad block information.

(3) Then I move the chip into the ST nand flash test machine and scanned it again. There are 7 bad blocks in the chip. You can see the attachment (problem2chip1-1,problem2chip1-2) to get the bad address.

Your second image shows only 5 bad blocks (and two blocks set aside for BBT). So only 5 of the original 16 bad blocks are still correctly marked bad.

The second chip: (attachment: problem2chip2-1 ~ problem2chip2-3) (1) Put another new chip into the ST nand flash test machine to get the initial bad block table which has one bad block (0x069e). (2) Solder the chip on the board first. Boot OLPC to OFW and execute command "test /nandflash". It comes out 1 bad block (0x069e) and BBTable is also set on the latest two blocks (3) Then I boot from USB. Execute command "flash_eraseall -j /dev/mtd0" to get the result with four more bad blocks (0x3ff80000 0x3ffa0000 0x3ffc000 0x3ffe000).

Two of those are the bad block table. I'm not entirely sure why four blocks end up being used, but it'll be related to the bad block table. I'll try to investigate. It isn't related to this trac issue.

(4) Reboot OLPC to OFW to execute command "test /nandflash". Get one bad block (0x069e). (5) Execute command "select /nand " "scrub" "test /nandflash" in order. It comes out no bad block and BBTable is set on the latest two blocks.

Again, you've lost information by using 'scrub'. Never do that.

  Changed 7 years ago by dwmw2

  • owner changed from dwmw2 to wad

  Changed 7 years ago by dwmw2

Can we read back from the CAFÉ chip's internal buffer? We could check that the contents match the intended write, after DMA.

Can we also try booting with 'cafe_nand.usedma=0' on a machine where this can be reproduced at will, and see if the problem persists?

Did we ever list the RAM chips used on the affected machines?

in reply to: ↑ description   Changed 7 years ago by shenki

NAND manufacturers:

B4 SHF72400015: ST

B2 SHF70500388: Hynix

  Changed 7 years ago by dwmw2

I've just been handed a machine in Quanta which seems to have a different problem. In this case, the kernel is reporting that it has corrected 1 symbol error in various eraseblocks. No JFFS2 errors are reported by the kernel -- all errors are being fixed up by the ECC engine, as $DEITY intended.

OpenFirmware (Q2C18) reports its strange 'Unsupported nodetype 2006' twice during the boot process. /openprom/built-on reports 0x013240d6, which I think means 2007-06-14, which would predate the commit which enabled error correction ( http://git.infradead.org/?p=openfirmware.git;a=commitdiff;h=8a6574ac38a80fbc70e206dc688d2f7e9ac71d8b )

If I update to the current OpenFirmware with all the tinfoil-hat stuff, is that going to break things when I'm booting from a USB disk (which needs extra delays added to make-root-hub-node) and testing a NAND flash which I keep erasing and running tests on?

  Changed 7 years ago by dwmw2

I made the driver a little more verbose when correcting errors. It now prints the address of the offending page, as well as the symbol/pattern used for correction. You can see two bitflips detected during the mount, which will be the two summary nodes that OpenFirmware complains about. Then a bunch of bitflips later. It's interesting that they all happen at exactly the same place in each page. It's possible that some of the affected eraseblocks were factory-marked bad blocks but that information was lost when the chip was scrubbed.

[  878.068408] PCI: Enabling device 0000:00:0c.0 (0000 -> 0002)
[  878.102660] NAND device: Manufacturer ID: 0x20, Chip ID: 0xd3 (ST Micro NAND 1GiB 3,3V 8-bit)
[  878.133624] Bad block table found at page 524224, version 0xFF
[  878.161872] Bad block table found at page 524160, version 0xFF
[  878.190770] nand_read_bbt: Bad block at 0x09840000
[  878.218219] nand_read_bbt: Bad block at 0x12080000
[  878.245393] nand_read_bbt: Bad block at 0x120a0000
[  878.271355] nand_read_bbt: Bad block at 0x120c0000
[  878.296731] nand_read_bbt: Bad block at 0x120e0000
[  878.321377] nand_read_bbt: Bad block at 0x16080000
[  878.345921] nand_read_bbt: Bad block at 0x22200000
[  878.370044] nand_read_bbt: Bad block at 0x39a00000
[  878.393511] nand_read_bbt: Bad block at 0x39a20000
[  899.025433] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102ff000:
[  899.054637] 0: flip 0x800 at symbol 325
[  899.082899] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1035f000:
[  899.114419] 0: flip 0x800 at symbol 325
[  900.368069] Empty flash at 0x28322918 ends at 0x28323000
[  901.664253] JFFS2 notice: (1859) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[  902.588138] JFFS2 notice: (1860) check_node_data: wrong data CRC in data node at 0x283227a0: read 0x431af696, calculated 0x469d551b.
[  934.087726] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102dd000:
[  934.124532] 0: flip 0x800 at symbol 325
[  934.158535] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102dc000:
[  934.195107] 0: flip 0x800 at symbol 325
[  934.231519] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102d8000:
[  934.269202] 0: flip 0x800 at symbol 325
[  934.304006] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102d7000:
[  934.341661] 0: flip 0x800 at symbol 325
[  934.379178] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102d0000:
[  934.417368] 0: flip 0x800 at symbol 325
[  934.454991] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102cb000:
[  934.493625] 0: flip 0x800 at symbol 325
[  934.530579] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102c7000:
[  934.569010] 0: flip 0x800 at symbol 325
[  934.605295] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102c6000:
[  934.643827] 0: flip 0x800 at symbol 325
[  934.813066] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 100f4000:
[  934.851932] 0: flip 0x800 at symbol 325
[  934.894284] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 100e3000:
[  934.933133] 0: flip 0x800 at symbol 325
[  935.019049] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102de000:
[  935.058469] 0: flip 0x800 at symbol 325
[  935.095627] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102dd000:
[  935.135517] 0: flip 0x800 at symbol 325
[  935.172680] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102dc000:
[  935.213182] 0: flip 0x800 at symbol 325
[  935.252509] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102d7000:
[  935.293619] 0: flip 0x800 at symbol 325
[  935.333918] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102d0000:
[  935.375333] 0: flip 0x800 at symbol 325
[  935.415714] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102cb000:
[  935.456869] 0: flip 0x800 at symbol 325
[  935.496675] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 102c7000:
[  935.537875] 0: flip 0x800 at symbol 325
[  935.698860] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 100f4000:
[  935.740099] 0: flip 0x800 at symbol 325
[  935.784229] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 100e3000:
[  935.824951] 0: flip 0x800 at symbol 325
[  935.890081] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10344000:
[  935.930296] 0: flip 0x800 at symbol 325
[  935.967753] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10342000:
[  936.007351] 0: flip 0x800 at symbol 325
[  936.046013] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10345000:
[  936.085402] 0: flip 0x800 at symbol 325
[  936.122182] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10344000:
[  936.161786] 0: flip 0x800 at symbol 325
[  936.198831] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10342000:
[  936.238541] 0: flip 0x800 at symbol 325
[  936.279001] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10357000:
[  936.319079] 0: flip 0x800 at symbol 325
[  936.358746] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1034e000:
[  936.398024] 0: flip 0x800 at symbol 325
[  936.435429] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10357000:
[  936.474807] 0: flip 0x800 at symbol 325
[  936.514465] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1034e000:
[  936.553789] 0: flip 0x800 at symbol 325
[  936.591552] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1035b000:
[  936.630882] 0: flip 0x800 at symbol 325
[  936.669772] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1035b000:
[  936.709144] 0: flip 0x800 at symbol 325
[  936.749096] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 1035f000:
[  936.787910] 0: flip 0x800 at symbol 325
[  936.955483] CAFÉ NAND 0000:00:0c.0: Corrected 1 symbol errors at 10433000:
[  936.994929] 0: flip 0x800 at symbol 325

I don't think this is the same as the original problem, although it's interesting that the pattern of bitflips is very similar. The ECC does catch these errors though, unlike the original report.

  Changed 7 years ago by dwmw2

I wrote a NAND test program which just erases each block, writes a random pattern to it and then reads it back. It fails quite quickly with DMA enabled, with the precisely the same failure mode. With CAFÉ DMA disabled, it doesn't seem to trigger (although I'll run the test for longer before declaring that it doesn't fail at all).

[root@localhost ~]# modprobe cafe_nand
[root@localhost ~]# ./nandtest /dev/mtd0
00f00000: checking...
compare failed. seed 2083805533
Byte 0x5327 is 92 should be 12
[root@localhost ~]# ./nandtest /dev/mtd0
027a0000: checking...
Bad block at 027c0000
07240000: checking...
compare failed. seed 753698683
Byte 0x5b27 is d1 should be 51
[root@localhost ~]# ./nandtest /dev/mtd0
027a0000: checking...
Bad block at 027c0000
02f00000: checking...
compare failed. seed 594221211
Byte 0x1db27 is e6 should be 66
[root@localhost ~]# rmmod cafe_nand
[root@localhost ~]# modprobe cafe_nand usedma=0
[root@localhost ~]# ./nandtest /dev/mtd0
027a0000: checking...
Bad block at 027c0000
0b000000: checking...
Bad block at 0b020000
0b100000: checking...
Bad block at 0b120000
0b200000: checking...
Bad block at 0b220000
0b300000: checking...
Bad block at 0b320000
0b400000: checking...
Bad block at 0b420000
0b500000: checking...
Bad block at 0b520000
0b600000: checking...
Bad block at 0b620000
0b700000: checking...
Bad block at 0b720000
0ee60000: checking...
Bad block at 0ee80000
121e0000: checking...
Bad block at 12200000
16440000: checking...
Bad block at 16460000
18360000: checking...
Bad block at 18380000
19c80000: writing... 

  Changed 7 years ago by dwmw2

The nandtest utilitity is now committed to the mtd-utils git tree. The problem definitely doesn't seem to show up if we refrain from using DMA for the NAND transfers. I haven't been able to reproduce it on CJB's B4 yet but will leave the test running overnight.

  Changed 7 years ago by dwmw2

On CJB's B4:

06c00000: checking...                                                           
compare failed. seed 860401109                                                  
Byte 0xbf2f is f7 should be f6      

  Changed 7 years ago by dwmw2

And again, this time with the kernel hacked to calculate a CRC of the dma buffer _after_ asking the CAFÉ to write.

compare failed. seed 378602249                                                  
Byte 0x13f2f is c9 should be c8 (w 904ec6c5 r 7645b119)                         
[root@olpc2 ~]# dmesg | egrep 904ec6c5\|7645b119                                
[12563.891411] page written, crc 904ec6c5                                       

  Changed 7 years ago by dwmw2

We always perform the DMA transfer to write to the flash chip immediately after memcpy() into the buffer. We suspected that there might be interaction between the CPU's dcache and the DMA. But adding a cache flush ('wbinvd') to the driver immediately before the DMA is triggered does not solve the problem.

  Changed 7 years ago by wmb@…

Uh, maybe the ECC isn't actually turned on in the read direction?

  Changed 7 years ago by dwmw2

The ECC has been well-tested, so I find that unlikely. Remember also that Quanta's machine was suffering only in OpenFirmware, where error correction wasn't enabled -- but Linux could read the device just fine (and reported ECC corrections while doing so). At the time I thought that was a different problem -- but if it's a problem with the internal SRAM of the CAFÉ, their machine could actually be exhibiting a different symptom of the _same_ problem.

I'll give a full dump including OOB area of an offending page though, and we can verify that ECC is working correctly...

The nandtest program just failed with this output:

215a0000: checking...
compare failed. seed 982916440
Byte 0xb72f is f7 should be f6

This means that the _intended_ data for the offending page are obtained by:

    srand(982916440);
    for (i=0; i<0xb000; i++) rand();
    for (i=0; i<0x800; i++) buf[i] = rand();
 /* ASSERT(crc32(-1, buf, 0x800) == 0x10358381); */

The contents of the ECC area are:

# nanddump -l 2048 -s $((0x215ab000)) -p /dev/mtd0 2>/dev/null  | grep OOB | head -1
  OOB Data: 64 dd 1d e2 2e b2 b4 05 2f 59 a6 e3 3a 3f ff ff

This matches what we get when we write the corrupted contents to flash. When the intended version is written to flash correctly, we get the following ECC data:

  OOB Data: d3 f1 4e 90 54 1f 60 e0 66 ec 93 0e eb 1f ff ff

follow-up: ↓ 50   Changed 7 years ago by dwmw2

At http://david.woodhou.se/olpc-nandtest.tar.gz there is a /boot directory suitable for putting onto a USB key. Any machine booted from this key will run the nandtest program for ten complete passes. It would be good to test a bunch of B4 machines to see how common this problem is.

in reply to: ↑ 49 ; follow-up: ↓ 57   Changed 7 years ago by RafaelOrtiz

Replying to dwmw2: Im testing http://david.woodhou.se/olpc-nandtest.tar.gz in a bunch of laptops, preliminary results are in here http://wiki.laptop.org/go/User:RafaelOrtiz/NandTest

no fatal errors found yet..

  Changed 7 years ago by dwmw2

The CAFÉ chip from the original offending B2 machine has been transplanted into a B4 where it continues to exhibit the same symptoms:

01020000: checking...
compare failed. seed 1377023367
Byte 0x1bb27 is 8c should be 0c

follow-ups: ↓ 53 ↓ 54   Changed 7 years ago by dwmw2

It may be possible to develop a non-destructive test, if we can write to and read from internal buffer of the NAND flash device rather than actually writing to the flash itself. That would allow units to be tested for this problem without using up flash erase cycles.

in reply to: ↑ 52   Changed 7 years ago by RafaelOrtiz

Replying to dwmw2:

That would be nice, it would make the testing easier and better.

in reply to: ↑ 52   Changed 7 years ago by wmb@…

Replying to dwmw2:

It may be possible to develop a non-destructive test, if we can write to and read from internal buffer of the NAND flash device rather than actually writing to the flash itself. That would allow units to be tested for this problem without using up flash erase cycles.

I tried for most of the afternoon today to write the internal buffer and read it back, with little success. The "random data output" command (05/e0) seems to work reasonably well, delivering data from the NAND chip's internal buffer into system memory.

I didn't have much luck with the "random data input" command (85/-). I tried to use it to put data into the buffer, then to use the "random data output" command to read it back. I didn't manage to get back anything approximating what I tried to write.

I tried using the 85 command after the 10/30 (read into memory) and 10/35 (read for copy-back, only into the internal buffer) sequences - still no luck. In the latter case, the 85 command changed the data that the subsequent 05/e0 sequence returned, but that data did not match what I thought I sent.

follow-up: ↓ 56   Changed 7 years ago by jg

From Gary Chiang:

The both failed part have same data code

88ALP01-TFJ1

A6609871.4

0651 AOP ES

TW

in reply to: ↑ 55   Changed 7 years ago by jg

Has a system gone back to Marvell for root cause analysis on an offending CAFE chip?

Replying to jg:

From Gary Chiang: The both failed part have same data code 88ALP01-TFJ1 A6609871.4 0651 AOP ES TW

in reply to: ↑ 50   Changed 7 years ago by RafaelOrtiz

Replying to RafaelOrtiz:

Replying to dwmw2: Im testing http://david.woodhou.se/olpc-nandtest.tar.gz in a bunch of laptops, preliminary results are in here http://wiki.laptop.org/go/User:RafaelOrtiz/NandTest no fatal errors found yet..

From my testing in numerous B4s and B2s this problem seams to be not common.

  Changed 7 years ago by dwmw2

  • milestone changed from Trial-3 to Production Start

(At least) one of the C1 machines also has this problem.

Changed 7 years ago by dwmw2

Patch (against older kernel) to make JFFS2 verify (most) NAND writes and panic on compare failure

  Changed 7 years ago by dwmw2

Quanta's run-in tests already write large amounts of data to the filesystem. I've attached a patch which can be added to the kernel they use for these tests. It reads back from the NAND after each write from JFFS2's internal write-buffer, and will panic the machine with a suitable error message if corruption is found.

  Changed 7 years ago by wad

  • status changed from new to assigned

Three chips showing the problem were sent back to Marvell. There are unconfirmed rumors that Marvell will be changing their ATE testing to detect the problem in chip manufacturing, saving us the effort of reworking the MB. We are seeking confirmation of this.

  Changed 7 years ago by wad

  • priority changed from blocker to high

Dropped priority of this problem, as we have a culprit and manufacturing tests in place to catch units exhibiting this problem.

  Changed 5 years ago by wmb@…

  • status changed from assigned to closed
  • resolution set to fixed

Last state change and comment by sunrise1 was spam. Re-closing

Note: See TracTickets for help on using tickets.