Opened 3 years ago

Closed 3 years ago

#11220 closed task (fixed)

Probe ext4 corruption

Reported by: martin.langhoff Owned by: martin.langhoff
Priority: normal Milestone: 1.75-software
Component: distro Version: not specified
Keywords: Cc: jnettlet, dsd
Blocked By: Blocking:
Deployments affected: Action Needed: never set
Verified: no

Description

We've seen Jon Nettleton hit some disk corruption repeatedly on his development/test machine. Diagnosis of the prob led us to #11210 but it is unclear whether that is the issue.

This is one of our current risks -- it is a concern because we need to know real soon whether this is a hw issue (related to the eMMC parts) or not. We have hit corruption issues with ext4 in the past (#9513) and fell back to ext3. It is not clear however that ext4 is suspect: the whole 11.2.0 dev cycle was done under ext4 and no disk corruption incidents were reported AFAIK.

Jon seemed to hit it while:

  • developing, compiling,
  • using an ext SD card in the slot,
  • running a patched kernel and xorg
  • presumably crashing a lot

We need to consider action around this

  • try to force the error -- I'll set up a test rig for this, applying unclean shutdowns on a couple of SKU198 units
  • keep our eyes open for disk corruption reports, specially in builds including the new gfx code
  • be prepared to switch back to ext3

Attachments (3)

fstorture.sh (1.2 KB) - added by martin.langhoff 3 years ago.
Run with 'install' then reboot
dmesg-ext2error-vs-asix2.log (44.3 KB) - added by martin.langhoff 3 years ago.
fstorture.2.sh (2.9 KB) - added by martin.langhoff 3 years ago.
Current fstorture.sh

Download all attachments as: .zip

Change History (9)

Changed 3 years ago by martin.langhoff

Run with 'install' then reboot

comment:1 Changed 3 years ago by martin.langhoff

Run

bash -x /media/myusb/fstorture.sh install

It will enable sysrq, copy itself to /usr/local/bin, add a call to itself to rc.local.

comment:2 Changed 3 years ago by martin.langhoff

fstorture.sh ran over the weekend on 3 SKU 198 units, with galcore kernel + xorg (EXA accel, no Xv) - completing around 5K cycles on each computer; no corruption spotted.

Right now upgrading to os4, updated fstorture script to use the EC poweroff (instead of sysrq b), and installing SD card with ext4 Jon suspects an SD card w ext4 may be part of the recipe.

comment:3 Changed 3 years ago by martin.langhoff

After a few days of running with an SD card, and changes to the script that write to the SD card, and perform an EC poweroff, no corruption is observed in 3 units (only one with SD card).

Jon mentioned he is using a USB-Ethernet dongle. I have added 2 asix USB-Ethernet, put a cable across them, and added a pingflood to the fstorture test.

With the asix devices connected, and while updating the script (and testing nc and ping between interfaces) we got a few interesting messages: ext2 lookup errors (over /bootpart possibly) and X/Galcore problems. See dmesg attached...

Changed 3 years ago by martin.langhoff

comment:4 Changed 3 years ago by martin.langhoff

The 2 asix USB ethernet are on the 2 ports on the right. Not sure how our port numbering goes...

comment:5 Changed 3 years ago by jnettlet

Now that we have produced something. We should probably try to reproduce with the Cache Coalesce fixes that we found. These tests have now made it 48 hours without erroring out the memtest. We are waiting approval from Marvell for approval so they are currently applied to a private git branch. You can use this repo ssh://dev.laptop.org/home/dilinger/private_git/olpc-kernel and there is a arm-3.0-wbcache branch.

Changed 3 years ago by martin.langhoff

Current fstorture.sh

comment:6 Changed 3 years ago by martin.langhoff

  • Resolution set to fixed
  • Status changed from new to closed

We haven't seen any of this for a long long time.

Note: See TracTickets for help on using tickets.