Opened 7 years ago

Closed 6 years ago

Last modified 13 months ago

#6532 closed defect (fixed)

SD Card Corruption

Reported by: haralds Owned by: dsaxena
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: kernel Version:
Keywords: release? 8.1.2:? blocks:8.1.2 Cc: PierreOssman, jg, dwmw2, mtd, kronenpj, mstone, cjb, pgf, dilinger, Rmyers, aab@…, gregorio, kimquirk
Blocked By: Blocking:
Deployments affected: Action Needed: communicate
Verified: no

Description

It looks like some of the versions of current update.1 candidates and joyrides have serious problems with flash drives and SD cards. This especially effects ext 3 use.

I first blamed cards, but even the good ones get clobbered.

Some additional info:

  • As somebody else remarked, some of this appears to depend to some degree on the response time of the card or USB flash stick/adapter. Some cards I have do not work at all, but are fine in cameras an on OS X/WIN32.
  • Problem go from just corrupting the boot or super block to a whole range of blocks not being written to flash resulting in an unrecoverable disk.
  • This appears to be more of an issue with Joyride and update.1 candidate builds than the shipping version (656.)

I seemed to have less trouble, when running from the SD card directly. Booting and running from NAND seemed to create the issue with external cards.

I also noticed that power management did not seem to work/was disabled on 651/656 (this is documented) and when running from the SD card on other builds. It was enabled in a number of experimental builds running from NAND. Could it be the power management is interfering with the SD/USB flash writes?

Attachments (1)

0001-MMC-Add-400ms-timeout-to-SD-resume-path.patch (1003 bytes) - added by dsaxena 6 years ago.
Temporary workaround patch

Download all attachments as: .zip

Change History (76)

comment:1 Changed 7 years ago by rsmith

  • Cc PierreOssman added
  • Component changed from linuxbios to kernel
  • Owner changed from rsmith to dilinger

comment:2 Changed 7 years ago by Rmyers

Aha! I was noticing my partition table disappearing on a SD card formatted ext2. I haven't been able to characterize it. I thought I was doing something wrong.

I'm currently running candidate 691. I recall this was happening in 656 also.

If it's any use - the card is a SanDisk 2G (part# SDSDB-2048-A11).

comment:3 Changed 7 years ago by haralds

  • Priority changed from normal to blocker

This is repeatable on my system. You run the shipping version 656 - and there is no corruption on the SD card in the SD slot independent of the system running on it (I have tried 693, 1700, and other builds.)

I have also tried joyrides on the NAND, and the problem persists on anything higher than 656.

I update to 693, and the SD card super block is clobbered. If I immediately shut down, I can fdisk the card on another system, and the filesystem is still there.

comment:4 Changed 7 years ago by frankprindle

I don't see how you can possibly let Update.1 out the door with this bug!!!

Corrupting SD cards' partition tables is terribly user-unfriendly. This did not happen in Ship2.2 build 656 (but then again, it didn't suspend either on lid closure either.)

I just tried running Update.1 candidate build 703, unwittingly put my 4GB SD card with a much extended/enhanced build 656 on it in the slot, watched it mount in /media, then initiated a suspend/resume cycle by closing the lid. After that, it was still mounted and seemed superficially ok, but later when I attempted to reboot from it, it was obvious that the partition table on the card had been wiped out.

After spending about a half hour with fdisk (not only was the table wiped out, but the logical geometry had been altered, so I had to restore that too), it was back to being bootable and usable again. Hardly the kind of thing you should expect folks in far-off countries to do though, just because they closed the lid with an SD card inserted.

FYI, this particular SD card had 2 partitions, the first an 3.5 GB ext3, the second a 0.5GB swap.

comment:5 follow-up: Changed 7 years ago by mstone

  • Keywords release? added

Frank: Thanks for bringing this bug to my attention.

dilinger: comments, please?

comment:6 Changed 7 years ago by cjb

  • Cc jg dwmw2 added
  • Milestone changed from Never Assigned to Update1.1

We need to do something about this.

comment:7 Changed 7 years ago by cjb

  • Blocking 6893 added

comment:8 Changed 7 years ago by mtd

  • Cc mtd added

comment:9 Changed 7 years ago by kronenpj

  • Cc kronenpj added

comment:10 follow-up: Changed 7 years ago by cscott

As with #4013, it seems like the reporters are swapping to SD, and then suspending. Don't Do That.

comment:11 in reply to: ↑ 10 Changed 7 years ago by haralds

Replying to cscott:

As with #4013, it seems like the reporters are swapping to SD, and then suspending. Don't Do That.

Nope. As far as I know, the SD swap partition is not mounted, when booting from NAND.

comment:12 Changed 7 years ago by mstone

  • Cc mstone added

comment:13 in reply to: ↑ 5 Changed 7 years ago by dilinger

Replying to mstone:

Frank: Thanks for bringing this bug to my attention.

dilinger: comments, please?

Have not looked into it yet..

comment:14 Changed 7 years ago by dsaxena

  • Owner changed from dilinger to dsaxena
  • Status changed from new to assigned

Taking this one. I just got my XO, need to get a stack of SD cards and just get familiar with the sugar environment, etc so it will be at least a few days before I can reproduce and have more data to look at.

comment:15 Changed 7 years ago by Lerc

I just got bitten by this. I use a 4 gig sd as VFAT with a 2 gig ext3 driveimage on the VFAT. Here's more or less what hapened.

I have a script to mount the ext3 and when I used the XO after auspend I noted the ext3 wasn't there so I ran the script again It reported that it could not find the ext3 Imagefile.

Looking in /media/ I found (going from fairly recent memory but still ay not be exact)

disk disk-1 disk-2 disk-3

disk and disk-2 appeared to be simple empty directories.
disk-3 had the drive image inside and appeare to be the real thing.
disk-1 appeare to be a mount but ls showed a couple of gibberish files

Things generally not looking good so I rebooted. On reboot /media was empty

attmpting to manually mount the SD reported a bad superblock
If I were to tae a stab I'd say whatever happened to make disk-1 is the corruption problem.
Does the datastore try and setup as soon as it sees somethinging in /media? A bad mapping and a trigger happy writer would be a recipie for corruption

comment:16 Changed 7 years ago by mstone

In update.1- and joyride-series builds, we've got a udev rule that dumps events into the HAL socket. In the case of storage devices like your SD card, HAL fires off some DeviceAdded D-Bus events which are received by the Journal's volumesmanager.py logic. The Journal then automounts the storage.

comment:17 Changed 7 years ago by dsaxena

I've reproduced this locally with a 1G SD card. It is intermittent and once managed to OOPS the kernel. Will get a kernel stack trace and dig in more once I get local kernels booting.

comment:18 Changed 7 years ago by dsaxena

I've logged that the following is happening in cases where this failure occurs:

SD card is inserted and appears as /dev/mmcblk0

In my case I have 1 partition (/dev/mmcblk0p1) that gets mounted

Upon suspend/resume, the device gets rediscovered as /dev/mmcblk1. (In cases where the failure does not occur, the device is still at /dev/mmcblk0).

At this point, the partition table is still OK as /dev/mmcblk1p1 gets mounted, but running fdisk on /dev/mmcblk1 shows that it has been destroyed, so the corruption is happening sometime after or during the mount process.

comment:19 Changed 7 years ago by Lerc

There's definitely something odd going on with the resume mount process

I've reproduced the multi-mount problem. I currently have in /media

drwx------ 2 root root 40 2008-06-05 07:23 disk
drwxrwxrwx 3 olpc root 32768 1970-01-01 00:00 disk-1
drwxrwxrwx 3 olpc root 32768 1970-01-01 00:00 disk-2
-rw-r--r-- 1 root root 323 2008-06-06 06:30 .hal-mtab
-rw------- 1 root root 0 2008-06-05 07:23 .hal-mtab-lock

relevant bits of /etc/mtab

/dev/mmcblk1p1 /media/disk-1 vfat rw,nosuid,nodev,noatime,uhelper=hal,uid=500,umask=0,utf8,iocharset=utf8 0 0
/dev/mmcblk2p1 /media/disk-2 vfat rw,nosuid,nodev,noatime,uhelper=hal,uid=500,umask=0,utf8,iocharset=utf8 0 0

in /media/.hal-mtab

/dev/mmcblk0p1 500 0 vfat nosuid,nodev,uhelper=hal,uid=500,umask=0,noatime,utf8,iocharset=utf8 /media/disk
/dev/mmcblk2p1 500 0 vfat nosuid,nodev,uhelper=hal,uid=500,umask=0,noatime,utf8,iocharset=utf8 /media/disk-2

looking into /media/disk-1 reveals garbage. I don't know were it gets its garbge from since at this point there isn't a /dev/mmcblk1p1

generally things are all messed up.

comment:20 Changed 7 years ago by dsaxena

This is not an OLPC-specific issue. I just discovered that my laptop (Lenove X61s)
has a SD slot and I have reproduced the suspend/resume corruption running the stock
Ubuntu Hardy kernel ( 2.6.24-18-generic) on this machine. I will boot into kernel.org
latest when I have a chance and see if the problem persists.

comment:21 Changed 7 years ago by cjb

Wow, incredible. Maybe worth dropping Pierre an e-mail in case he's familiar with the problem.

comment:22 follow-ups: Changed 7 years ago by PierreOssman

This is by design and has been discussed to death in several places. The SD controller doesn't have any functionality to tell if you left the card in the slot during the suspend, so it is assumed to have been removed (failing to do so would give silent data corruption if you've put your card into some other device and back again as the filesystems lack suspend handling). You can make the kernel assume that the card didn't leave the slot by enabling "unsafe resume" in the kernel config.

comment:23 in reply to: ↑ 22 Changed 7 years ago by dsaxena

Replying to PierreOssman:

This is by design and has been discussed to death in several places. The SD controller doesn't have any functionality to tell if you left the card in the slot during the suspend, so it is assumed to have been removed (failing to do so would give silent data corruption if you've put your card into some other device and back again as the filesystems lack suspend handling). You can make the kernel assume that the card didn't leave the slot by enabling "unsafe resume" in the kernel config.

Pierre, by "by design", are you referring to the rediscovery of the the card as a different block device (as per the fix in #4013) or do you mean the data corruption? I don't think the later is really acceptable as something for shipping kernels on OLPC or in upstream.

comment:24 in reply to: ↑ 22 ; follow-up: Changed 7 years ago by dsaxena

Replying to PierreOssman:

You can make the kernel assume that the card didn't leave the slot by enabling "unsafe resume" in the kernel config.

CONFIG_UNSAFE_RESUME is disabled on the Ubuntu 2.6.24 kernel and enabled on the OLPC 2.6.22 kernel, so it does not seem related to the corruption issue.

comment:25 in reply to: ↑ 24 ; follow-up: Changed 7 years ago by PierreOssman

Replying to dsaxena:

Pierre, by "by design", are you referring to the rediscovery of the the card as a different block device (as per the fix in #4013) or do you mean the data corruption? I don't think the later is really acceptable as something for shipping kernels on OLPC or in upstream.

The different block device name is related, but also different. It is caused by something keeping the original device open which prevents the new discovery to claim the same name. Instead of disappearing completely, it reappears under another name.

The only safe way of avoiding data corruption is to make sure that you do not suspend with dirty filesystems. Either umount them before suspend, or fix up the filesystems to do this automatically on suspend (i.e. improve the kernel).

Replying to dsaxena:

CONFIG_UNSAFE_RESUME is disabled on the Ubuntu 2.6.24 kernel and enabled on the OLPC 2.6.22 kernel, so it does not seem related to the corruption issue.

The OLPC has a hardware bug that makes the system behave as if MMC_UNSAFE_RESUME is always disabled.

comment:26 in reply to: ↑ 25 ; follow-up: Changed 7 years ago by dsaxena

  • Cc cjb added

Replying to PierreOssman:

The only safe way of avoiding data corruption is to make sure that you do not suspend with dirty filesystems. Either umount them before suspend, or fix up the filesystems to do this automatically on suspend (i.e. improve the kernel).

Chris, how hard would it be as a short-term measure to have our userspace either unmount the SD card or force a sync to the SD card before suspend?

Replying to dsaxena:

CONFIG_UNSAFE_RESUME is disabled on the Ubuntu 2.6.24 kernel and enabled on the OLPC 2.6.22 kernel, so it does not seem related to the corruption issue.

The OLPC has a hardware bug that makes the system behave as if MMC_UNSAFE_RESUME is always disabled.

Do you/we have any details/documentation on this?

Thanks for your input Pierre.

comment:27 follow-up: Changed 7 years ago by cjb

Hi,

Chris, how hard would it be as a short-term measure to have our userspace either unmount the SD card or force a sync to the SD card before suspend?

Suspend is mediated by OHM, so anything doable with root-running userspace code is possible.

force a sync to the SD card before suspend?

Would there be anything more specific than sync(2) for doing so? As a pessimal case, calling sync(2) iff there's an SD card inserted before suspending doesn't sound too bad.

I'm still pretty confused about this, though. Why is sync(2) required to avoid *partition table corruption*, and why doesn't this happen on USB?

comment:28 Changed 7 years ago by mstone

Must we also pay attention to whether we're booted off of the peripheral storage device in question?

comment:29 Changed 7 years ago by haralds

I have not had the problem, when booting from SD Card.
But, I am a sample size of one ;-)

comment:30 in reply to: ↑ 26 ; follow-up: Changed 7 years ago by PierreOssman

Replying to dsaxena:

Chris, how hard would it be as a short-term measure to have our userspace either unmount the SD card or force a sync to the SD card before suspend?

Just remember that many filesystems mark the filesystem as dirty until it is unmounted. You'll probably remove the greatest risk for corruption using a sync though. Also, FAT will be fine using just sync. You should also note ticket #4013.

The OLPC has a hardware bug that makes the system behave as if MMC_UNSAFE_RESUME is always disabled.

Do you/we have any details/documentation on this?

Ticket #1339.

Replying to cjb:

I'm still pretty confused about this, though. Why is sync(2) required to avoid *partition table corruption*, and why doesn't this happen on USB?

The partition table problem is most likely something else. The only so far known problem is lost writes, which should not cause any partition table issues. My first guess would be something power related.

Replying to mstone:

Must we also pay attention to whether we're booted off of the peripheral storage device in question?

I have some faint recollection of suspend being actively crippled when booting of SD because of this problem.

comment:31 in reply to: ↑ 27 Changed 7 years ago by dsaxena

Replying to cjb:

I'm still pretty confused about this, though. Why is sync(2) required to avoid *partition table corruption*, and why doesn't this happen on USB?

Good point. I don't know enough about SD at the moment to make any educated comments so
I need to dig more before I can answer that. I also need to see that I can reproduce this regularly on !OLPC, which so far is a no.

comment:32 Changed 7 years ago by cjb

dsaxena:

Must we also pay attention to whether we're booted off of the peripheral storage device in question?

Yes; we outright die if try to resume when / is on SD, so we need to inhibit that.

ossman:

Also, FAT will be fine using just sync

Thanks! That's a much better option. (I wonder whether we already do it.)

The partition table problem is most likely something else. The only so far known problem is lost writes, which should not cause any partition table issues. My first guess would be something power related.

Okay; we need to concentrate on this symptom.

comment:33 Changed 7 years ago by pgf

  • Cc pgf added

comment:34 Changed 7 years ago by dsaxena

  • Cc dilinger added

I've spend some time digging deep into the bowels of the VFS and block layer and
gathering some debug output and have an explanation for the partition table corruption:

Upon coming out of resume, the SD code, with CONFIG_MMC_UNSAFE_SUSPEND enabled, checks
to see if there is a card plugged into the system and whether that card is the same
as the one that was plugged into the system at suspend time. This is accomplished by
reading the card ID of the device and for some reason, very possibly #1339, we fail
this detection. In this case, the kernel removes the old device from the system and in
this execution path, the partition information for this device is zeroed.

Even though the device is removed, the device is still mounted and upon unmount,
ext2 syncs the superblock, even if the file system is sync'd beforehand. The superblock
is block 0 of the partition and the block layer adds to this the partition start
offset before submitting the write to the lower layers. As the partition information
has already been zeroed out, we end up writing to block 0 of the disk itself, overwriting
the partition table and the geometry information. I've verified this by both gathering
debug output and 'dd' + 'hexdump' of corrupted and uncorrupted media.

Some interesting points:

  1. We are able to delete a block device even though it is still mounted.
  2. Even though the device has been deleted, the write submitted to it does not fail.

Note that this is still not 100% reproducible and in certain cases the superblock
write during unmount does fail with block I/O errors, meaning that the queue is properly deleted. As per dilinger's comments on IRC, the VFS has lots of refcounts and there is a timing issue/race condition that we're hitting. As per #1339, we may be able to add an OLPC specific hackto wait 500ms or so upon resume to get around this. I will try this but I don't think this is acceptable given our suspend/resume requirements.

Something I don't quite understand at the moment is how/when our userland env (journal
specifically I think?) unmounts the device as I've been testing via command line suspend
mount, and unmount while running in console mode.

Next steps:

  1. Get an understanding of the what is happening with our userland and brainstorm with cjb about the possibility of simply unmounting the SD device upon suspend. There are issues around this as we may have files open and that will keep us from suspending.
  2. Test adding a timeout to the resume path to see if it solves our problem to validate that it is indeed something related to our HW.
  3. Dig into the unmount/write to non-existing bdev some more nad discuss this upstream if needed.

(Adding dilinger to cc:)

comment:35 Changed 7 years ago by dsaxena

I have validated that adding a 500ms timeout in the resume path causes this issue to go away.

comment:36 follow-ups: Changed 7 years ago by cjb

That's a good worst-case, then. Does something like 100ms work, too?

comment:37 in reply to: ↑ 36 Changed 7 years ago by dsaxena

Replying to cjb:

That's a good worst-case, then. Does something like 100ms work, too?

Next thing to try. I will decrease the timeout until it starts crashing. Related to
this, Mitch sent me the OFW SD code so I can see how it is interacting with the HW.

comment:38 follow-up: Changed 7 years ago by Rmyers

  • Cc Rmyers added

Great to see progress on this. A couple of questions: Is adding the timeout considered to be a final fix, or is an underlying cause going to be addressed? Is adding the timeout something that can be applied as a patch to test?

comment:39 in reply to: ↑ 38 Changed 7 years ago by dsaxena

Replying to Rmyers:

Great to see progress on this. A couple of questions: Is adding the timeout considered to be a final fix, or is an underlying cause going to be addressed? Is adding the timeout something that can be applied as a patch to test?

The final solution may end up being a timeout of a different value but I am digging into the
code a bit more to see if there is another option. Even with a timeout, the current patch (follows) is a hack and the timeout needs to be handled via a quirk of some sort in the code path.

diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index 918477c..fc01f97 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -527,6 +527,8 @@ static void mmc_sd_resume(struct mmc_host *host)

        mmc_claim_host(host);

+       msleep(500);
+
        err = mmc_sd_init_card(host, host->ocr, host->card);
        if (err != MMC_ERR_NONE) {
                mmc_remove_card(host->card);

comment:40 in reply to: ↑ 36 Changed 7 years ago by dsaxena

Replying to cjb:

That's a good worst-case, then. Does something like 100ms work, too?

So trying 100, 200, 300, and 400ms, only 400ms guarantees that this works perfectly. Note that 400ms is what what Pierre suggested in #1339.

comment:41 Changed 7 years ago by dsaxena

As #1339 states, we really need to know _why_ we are not detecting the card, so for now I'm going to push the 400ms mdelay into our kernel to allow usage of SD cards w/o corruption and continue down the path of understanding what is happening at the lowest level of the driver.

Changed 6 years ago by dsaxena

Temporary workaround patch

comment:42 in reply to: ↑ 30 ; follow-up: Changed 6 years ago by ggoebel

Nice to see the workaround and progress on root causes!

Replying to PierreOssman:

Replying to mstone:

Must we also pay attention to whether we're booted off of the peripheral storage
device in question?

I have some faint recollection of suspend being actively crippled when booting
of SD because of this problem.

If the root cause is identified and a solution is possible, it'd be nice to make sure this case of booting off SD and/or mounting on / gets un-crippled.

comment:43 in reply to: ↑ 42 Changed 6 years ago by frankprindle

Replying to ggoebel:

Replying to PierreOssman:

Replying to mstone:

Must we also pay attention to whether we're booted off of the peripheral storage
device in question?

I have some faint recollection of suspend being actively crippled when booting
of SD because of this problem.

If the root cause is identified and a solution is possible, it'd be nice to make sure this case of booting off SD and/or mounting on / gets un-crippled.

I'd say it is very important, not just nice. The corruption issue (this ticket) is obviously foremost, but it's important that ticket 4013 also be resolved, since having everything in the root fs come up "not found" after a suspend/resume cycle is quite user unfriendly; using an SD swap partition should also work flawlessly, (what more reasonable place is there for a swap partition?) Note that in the absence of suspend (i.e. build 656), all of the above worked just fine.

That being said, congratulations on the significant progress on tracking down the cause(s) of the SD/suspend issues over the past couple weeks.

comment:44 Changed 6 years ago by Andrew Burgess

  • Cc aab@… added

comment:45 follow-ups: Changed 6 years ago by dsaxena

  • Action Needed set to never set
  • Resolution set to fixed
  • Status changed from assigned to closed

I am going ahead and marking this as fixed since the timeout patch has been committed to all OLPC kernel trees. A proper, upstream acceptable quirk to handle this is still needed, but that is not a blocker for using SD cards on the XO.

comment:46 Changed 6 years ago by dsaxena

  • Milestone changed from 8.1.1 (was Update1.1) to 8.2.0 (was Update.2)

comment:47 in reply to: ↑ 45 ; follow-up: Changed 6 years ago by cjb

Hi,

Replying to dsaxena:

I am going ahead and marking this as fixed since the timeout patch has been committed to all OLPC kernel trees. A proper, upstream acceptable quirk to handle this is still needed, but that is not a blocker for using SD cards on the XO.

Does the delay happen regardless of whether an SD card is plugged in at resume time? If so, that's not an acceptable punishment for people who aren't using SD cards and have power management turned on.

comment:48 in reply to: ↑ 47 ; follow-up: Changed 6 years ago by dsaxena

Replying to cjb:

Hi,

Replying to dsaxena:

I am going ahead and marking this as fixed since the timeout patch has been committed to all OLPC kernel trees. A proper, upstream acceptable quirk to handle this is still needed, but that is not a blocker for using SD cards on the XO.

Does the delay happen regardless of whether an SD card is plugged in at resume time? If so, that's not an acceptable punishment for people who aren't using SD cards and have power management turned on.

Nope. That code path is only called when there is actually a device plugged into the slot.

comment:49 in reply to: ↑ 48 Changed 6 years ago by cjb

Replying to dsaxena:

Nope. That code path is only called when there is actually a device plugged into the slot.

Great. Objection withdrawn. :)

comment:50 in reply to: ↑ 45 ; follow-up: Changed 6 years ago by frankprindle

Replying to dsaxena:

I am going ahead and marking this as fixed since the timeout patch has been committed to all OLPC kernel trees. A proper, upstream acceptable quirk to handle this is still needed, but that is not a blocker for using SD cards on the XO.

So by closing this ticket and #4013 you are asserting that:

a) SD cards in use at suspend time will not have their partition table/logical geometry corrupted when the XO resumes.

b) SD card partitions mounted at suspend time will still be mounted and usable when the XO resumes (regardless of partition type).

c) No data written to an SD card filesystem will be lost through the suspend/resume cycle.

d) The XO will suspend and resume properly when it was booted from (and thus its root filesystem resides on) the SD card.

e) The XO will suspend and resume properly when there is an active swap partition on the SD card.

If all of the above have been tested and found to be working, I'd say you have inserted one mighty wonderful line-of-code! Just don't jump to conclusions without testing all the permutations. Great team detective work, by the way!!!

comment:51 Changed 6 years ago by Andrew Burgess

I was going to wait and test and just open another ticket but since you made a list:

f) The partition table won't get corrupted when the XO shuts down with a swap partition on the SD card.

I mentioned this on one of the other tickets that was closed. I have to remember to swapoff before shutdown now, which is not a huge deal.

Being able to suspend/resume with an active swap partition will be very nice and I appreciate that being fixed. I realize that the SD card working at all is of zero interest to 99.9% of the XO users (the 'real' users, the kids) so I consider myself fortunate.

comment:52 in reply to: ↑ 50 Changed 6 years ago by dsaxena

Replying to frankprindle:

Replying to dsaxena:

I am going ahead and marking this as fixed since the timeout patch has been committed to all OLPC kernel trees. A proper, upstream acceptable quirk to handle this is still needed, but that is not a blocker for using SD cards on the XO.

So by closing this ticket and #4013 you are asserting that:

a) SD cards in use at suspend time will not have their partition table/logical geometry corrupted when the XO resumes.

Yes

b) SD card partitions mounted at suspend time will still be mounted and usable when the XO resumes (regardless of partition type).

Yes

c) No data written to an SD card filesystem will be lost through the suspend/resume cycle.

Yes

d) The XO will suspend and resume properly when it was booted from (and thus its root filesystem resides on) the SD card.

e) The XO will suspend and resume properly when there is an active swap partition on the SD card.

Yes, though these are both is via "echo mem > /sys/power/state." We won't know for sure until we enable suspend/resume at the OHM level when and SD card is plugged in (#6893)

comment:53 Changed 6 years ago by dsaxena

  • Resolution fixed deleted
  • Status changed from closed to reopened

Re-opening the bug as we don't have a process to separately track upstream submission.

comment:54 follow-ups: Changed 6 years ago by mikus

This seems as good a place as any to post the following:

Was running 708 on my G1G1, with '/etc/ohm/inhibit-idle-suspend' present. Further, the XO was running a background computation, which used 100% of all available CPU cycles. I was EXTREMELY surprised to see the "power" light blinking -- the XO had suspended.

At the time, I had two removable USB storage devices plugged in, in addition to my "permanent" SD card. One of those removable storage devices was a hard disk, which provided disk storage for the computation and also provided a swap partition for the XO to use. In addition, at the time the suspend occurred, I had a 'rsync' operation going on between nand and the other removable storage device - an USB flash drive. When I 'resumed' (by touching the "power" button), the 'rsync' went into an error loop and the computation task closed because it could not checkpoint, while the primary text console (alt-ctl-F1) showed scads of errors regarding the system swap device.


My reason for posting this is to point out that not only SD cards are affected by this problem -- hard disks are as well.


[And why didn't the CPU being 100% busy prevent this suspend ?]

comment:55 in reply to: ↑ 54 Changed 6 years ago by frankprindle

Replying to mikus:

Was running 708 on my G1G1, with '/etc/ohm/inhibit-idle-suspend' present ... the XO had suspended.
At the time, I had two removable USB storage devices plugged in, in addition to my "permanent" SD card.

Yikes, I hope none of them had its partition table fried.

[And why didn't the CPU being 100% busy prevent this suspend ?]

Indeed... and with inhibit-idle-suspend. Wow!

comment:56 in reply to: ↑ 54 Changed 6 years ago by gnu

When the XO suspends, all USB devices lose power and drop off the USB bus. (Other than the WiFi chips, which only drop off the bus.)

If your USB disk drive was bus-powered, it would spin down. If it was separately powered, it might still go into an odd state when the USB bus is powered off.

On both USB and SD, the installed device may be doing internal processing even when the USB bus is idle (e.g. the device's controller chip may be erasing flash blocks for later use, or writing buffered data onto permanent flash storage). There's a protocol for when the host wants to power down or remove USB or SD devices; I don't think our suspend code follows these protocols, so we might power down these devices at a very inconvenient time for them, resulting in errors on resume.

The OLPC's generic USB suspend code also needs the same kind of special-case checks that look to see if the same devices are plugged in, and if so, resume those devices cleanly without re-mounting. Currently that code only seems to work for the WiFi chip (and maybe now SD).

Tickets #1423 (closed as if fixed), #2432 (closed as if fixed), #3767 ("jg: needs TLC after FRS"), #4876 are related. But there appears to be no straightforward, "USB disks die after suspend" bug filed yet; I suggest filing one, and copying these entries over, if you can reproduce this.

comment:57 Changed 6 years ago by gregorio

  • Cc gregorio kim added

Is this bug closed? Do you have an estimated ETA or scoping of what it will take?

Can you also include a brief test description. I think Kim wants to make this a blocker for 8.1.2...

Thanks,

Greg S

comment:58 Changed 6 years ago by mstone

  • Cc kimquirk added; kim removed

Greg,

The ticket can't possible be closed because we haven't

  • fixed the bug in any of our releases
  • pushed our changes upstream into the kernel (so far as I know)
  • discovered the root cause of the problem
  • fixed any of the related potential suspend-corrupts-my-peripheral bugs that Mikus and Frank are concerned about.

That being said, Deepak did great work in finding a tiny workaround that seems to prevent corruption of the partition tables on SD cards. Does that answer your question?

comment:59 Changed 6 years ago by gregorio

Hi Michael,

I did see http://lists.laptop.org/pipermail/devel/2008-July/016278.html and I don't see this one in any of the listed states but e should probably talk about how to interpret bug status when we follow up from our Trac meeting.

The main point on this one is that its a must have blocker for 8.1.2. Can you mark it as such and let me know how track that?

Can we also send out Deepak's work around to devel and get one round of review on it?

If its not acceptable, do we have a plan B?

Thanks a lot Deepak for making a big difference on this hot one!

Thanks,

Greg S

comment:60 follow-up: Changed 6 years ago by mstone

  • Action Needed changed from never set to communicate
  • Keywords 8.1.2:? blocks:8.1.2 added

Deepak's work has already received some wide-spread testing. Deepak - is one of your new kernels a patched 708 kernel?

comment:61 in reply to: ↑ 60 ; follow-ups: Changed 6 years ago by dsaxena

Replying to mstone:

Deepak's work has already received some wide-spread testing. Deepak - is one of your new kernels a patched 708 kernel?

Yes.

comment:62 in reply to: ↑ 61 Changed 6 years ago by hhardy

Replying to dsaxena:

Replying to mstone:

Deepak's work has already received some wide-spread testing. Deepak - is one of your new kernels a patched 708 kernel?

Yes.

What build would you like this fix tested on?

I have had problems with corruption of ext3, swap, and vfat sd partitions previously.

--HH.

comment:63 Changed 6 years ago by djones

It appears that whatever is causing this is also biting the OpenMoko people. (Either that or it's a huge coincidence.) Exact same symptoms: block 0 getting wiped, mmcblk0 coming back as mmcblk1 on resume, etc.

Andy Green on the Openmoko Community list:

"This is ultimately a resume race of some kind, the VFS layer corruption
and taking a whiz on block 0 (noticeable as it is) is downstream of
whatever is truly responsible..."

See:

http://lists.openmoko.org/pipermail/community/2008-July/023787.html

and the rest of that thread.

They are working very hard on this problem, "intensely" is how someone described it. Although most XO users currently aren't using SD cards and many people have considered the problem as somewhat incidental to OLPC's core mission, the OpenMoko community considers the SD card as essential. Also, OpenMoko isn't a non-profit, and will shift every engineer it can find onto this kind of blocker. They are in a real hurry to fix this.

They are now aware of OLPC ticket #6532 and are reading every word. It's now a race to see who understands the problem first, and who can find a good fix first.

Cross-fertilization between these two FOSS projects would be a really good idea.

comment:64 Changed 6 years ago by Andrew Burgess

I ment to post this a week ago. For me the SD card corruption is 100% fixed now. I run a swap to the first sd card partition and I could guarantee partition wipe by turning off power or shutting down with swap on. I could work around it 100% by running swapoff before power down. I never enabled suspend. Now everything works. It suspends and resumes at will with swap running. Shutdown or mash the power button, partition table is fine. Perhaps this was fixed upstream? Openmoko is still on kernel 2.6.24.

I vote to mark this FIXED.

I'm running joyride from a week ago, circa Jul 20th.

comment:65 Changed 6 years ago by cjb

  • Blocking 6893 removed

comment:66 in reply to: ↑ 61 ; follow-up: Changed 6 years ago by adin

Replying to dsaxena:

Replying to mstone:

Deepak's work has already received some wide-spread testing. Deepak - is one of your new kernels a patched 708 kernel?

Yes.

Sorry if I'm being a bit slow,but should I assume that Deepak's patch (dsaxena, right?) made it into the build 708/8.1.1 release? Or is it just in the joyride?

comment:67 in reply to: ↑ 66 Changed 6 years ago by dsaxena

Replying to adin:

Sorry if I'm being a bit slow,but should I assume that Deepak's patch (dsaxena, right?) made it into the build 708/8.1.1 release? Or is it just in the joyride?

It is currently only in joyride (which is running the 2.6.25 kernel) and in build 710. If you feel comfortable just updating a kernel RPM (see instructions in wiki), you can grab the 710 kernel from http://dev.laptop.org/~dsaxena/kernels/8.1/kernel-2.6.22-20080808.olpc1.4c233ce8ed8f9cb.i586.rpm.

comment:68 Changed 6 years ago by gregorio

Hi Adin,

This fix is not in 708. See the 8.1.1 release notes.
http://wiki.laptop.org/go/Release_Notes

Thanks,

Greg S

comment:69 Changed 6 years ago by genesee

Help? My SD has been flakey since 703, (now running joyride-2401). It looks mounted okay but cannot copy or drag and drop to it in Journal.

comment:70 Changed 6 years ago by hhardy

Using build 711:

32 GB

Made an ext3 partition and filesystem on the raw device.

This worked till I rebooted, then the filesystem was trashed and it looked like an unformatted device.

So I made a mountpoint at /mnt/sdhc.

Umount didn't work, so I used the journal to unmount the sdhc.

Then I made a valid partition table and ext3 filesystem. Then I mounted my partition. The partition table seemed to be deleted right away (possibly after entering power saving dull screen mode).

Upshot is I have a 32 GB sdhc card which cannot be used.

comment:71 Changed 6 years ago by pgf

having talked through this with henry, we think it's at least possible that some of his symptoms might have been caused by the modern magic of multiple mountpoints -- i.e., if sugar had his SD card mounted at the time that he created his filesystem(s), this could explain a lot. but this isn't definitively what happened.

comment:72 Changed 6 years ago by Rmyers

Are we sure this is not part of larger SD card issues? See #6154, particularly my recent comment. I've seen some of this since the supposed fix of the partition table issue. My partition table no longer goes away, but I still don't trust using a SD card.

comment:73 follow-up: Changed 6 years ago by frankprindle

  • spec_reviewed set to 0
  • spec_stage set to unknown

I just installed 8.2.0 (build 767 downloaded from http://download.laptop.org/xo-1/os/official/latest/ext3/xo-1-olpc-stream-8.2-build-767-20081001_1633-devel_ext3.img.bz2 ) on a 4GB SD card, as well as 8.2.0 (build 767 bundled with activities downloaded from http://download.laptop.org/xo-1/custom/g1g1/gg-767-4/gg-767-4.img ) on the internal NAND flash.

I was somewhat dismayed to find that apparently this ticket is NOT fixed in the official 8.2.0 release, as evidenced by the fact that suspend is inhibited when booted from the SD card (and thus this ticket: http://dev.laptop.org/ticket/6893 is not closed as indicated, but is rather the inhibit is still in effect.)

Two other strange symptoms when running 8.2.0 from the SD card (which may or may not be related):

1) In the sugar control panel, both the

Serial Number and Firmware version come
up as Not Available.

2) The led indicator that's shaped like

this: (o) does not blink with wireless
network activity (it never lights up
at all), even though the wireless works
fine.

With the same version (I can only assume it is) booted from NAND, both of these work as expected. So something is still funny in SD land.

comment:74 in reply to: ↑ 73 Changed 6 years ago by dsaxena

  • Resolution set to fixed
  • Status changed from reopened to closed

Frank,

I just reproduced what you saw and I am going to open two new bugs and close this one out. %6532 was specifically to track the corruption of partition tables on suspend/resume and that has been fixed and I don't want to overload it with every SD related issue.

comment:75 Changed 6 years ago by frankprindle

OK, but if this corruption issue is fixed, why is suspend still inhibited when booting from an SD card?

Note: See TracTickets for help on using tickets.