Opened 7 years ago

Closed 6 years ago

#4013 closed defect (fixed)

sd goes unusable after a suspend cycle

Reported by: PierreOssman Owned by: dsaxena
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: kernel Version: Development build as of this date
Keywords: killjoy? Cc: aab@…, dilinger, dwmw2, axboe, tomeu, marco, eeksock, kronenpj, mtd, dsaxena
Blocked By: Blocking: #6893
Deployments affected: Action Needed: never set
Verified: no

Description

Something wedges in the block layer for the mmc block driver so that sd cards go unusable after a resume.

Cards are properly detected and added to relevant buses, but it seems the pre-suspend block device (mmcblk0) is still around somewhere, blocking creation of a new one.

Attachments (1)

logs.CSN749007C4.2008-01-27.19-01-05.tar.bz2 (28.6 KB) - added by eeksock 7 years ago.
Logs for eeksock's test case

Download all attachments as: .zip

Change History (41)

comment:1 Changed 7 years ago by jg

  • Milestone changed from Never Assigned to First Deployment, V1.0
  • Priority changed from normal to blocker

comment:2 Changed 7 years ago by gnu

This may be a duplicate of #1339.

comment:3 Changed 7 years ago by PierreOssman

The problem is indeed that the old block device is still around. The python script datastore-service is keeping some files open on the old device, preventing it from being properly released. Killing the service allows the new block device to function correctly.

Three possible solutions:

  1. Figure out why the block layer is wedging a new queue because of an old one lingering around.
  1. Get in the new fancy "revoke" patches that kill off open fd:s.
  1. Get datastore-service to not keep files open when it isn't actively using them.

comment:4 Changed 7 years ago by jg

  • Cc dilinger dwmw2 added
  • Component changed from kernel to datastore
  • Owner changed from dilinger to bcsaller

comment:5 Changed 7 years ago by kimquirk

  • Keywords killjoy? added

comment:6 Changed 7 years ago by tomeu

  • Owner changed from tomeu to jg

Don't think anything can be done in the DS for Update-1 about this.

This ticket should probably be assigned to some other module and owner.

comment:7 Changed 7 years ago by jg

  • Cc axboe bcsaller added
  • Component changed from datastore to kernel
  • Owner changed from jg to dwmw2

Also causing bug #4582. We need to fix the underlying bug here, even if long term it would be nice if DS were more robust against error.

Jens, any help/advice you can give would be greatly appreciated. Dave's taking a look at this.

comment:8 Changed 7 years ago by dwmw2

This stops mmc_block from freeing up the minor number while it's still actually in use by the system (even though the card is gone). So if you still have /dev/mmcblk0 open (or mounted) when it goes away and comes back again, it'll appear as /dev/mmcblk1 instead. This matches the behaviour of other removable media.

http://david.woodhou.se/mmcblock-reuse.patch

A scheme where the existing /dev/mmcblk0 device can be 'reactivated' when the card comes back would be nice, but there are issues there if the contents of the device have been modified externally. We may need a solution for that in the CAFÉ suspend case though.

comment:9 Changed 7 years ago by cjb

  • Milestone changed from Update.2 to Retriage, Please!

We're going to need a solution to this before Update.1, lest we get a whole bunch of "OHM did an autosuspend and now my SD card doesn't work" support requests.

A final fallback, I suppose, is to have OHM refuse to suspend at all if there is a mounted SD card.

comment:10 follow-up: Changed 7 years ago by dlang

what I'm seeing is that the card is still accessable from the command line, just not from the journal, not that it's completely inacessable.

comment:11 in reply to: ↑ 10 ; follow-up: Changed 7 years ago by cjb

Replying to dlang:

what I'm seeing is that the card is still accessable from the command line, just not from the journal, not that it's completely inacessable.

That's interesting. Did you have to remount it or anything?

comment:12 Changed 7 years ago by jg

  • Cc tomeu marco added; bcsaller removed

Dave? Marco? Tomeu?

comment:13 Changed 7 years ago by dwmw2

It will have been automatically remounted, won't it? I fixed the bug which made it lock up and never work again. It doesn't surprise me that it works again from the command line. It's just that the existing file descriptors to the previously-mounted instance of the same card won't work, and the journal keeps the files open permanently.

comment:14 Changed 7 years ago by jg

  • Milestone changed from Retriage, Please! to Update.1

comment:15 Changed 7 years ago by tomeu

Ok, just tried in an update.1 build and things seem to be working surprisingly well.

After resume, HAL tells the journal that the old device disappeared. The journal then does the same we do when a kid yanks an usb stick, request the DS to unmount the xapian DB in that mount point. This fails as the files are not there anymore, but the journal copes with that and removes the icon from the devices bar.

Then when HAL announces that the new SD card device has appeared, the journal mounts it and notifies the DS to open the xapian index in there. The journal looks now as before suspending.

The only issue I see is that if not all changes had been flushed to the index when we suspended, those changes will be lost. This same problem arises if a kid yanks a usb stick.

This last issue is mitigated by the fact that we flush changes every 20 changes or every minute since the last change.

So perhaps we can live with that possibility of data loss, or we could increase the flush frequency for removable devices, or we could flush at every change. Suggestions?

comment:16 follow-up: Changed 7 years ago by dwmw2

Don't we sync before suspend?

comment:17 in reply to: ↑ 16 Changed 7 years ago by tomeu

Replying to dwmw2:

Don't we sync before suspend?

Which sync are you referring to? The DS currently doesn't know we are going to suspend, so it does nothing. The flush I was talking above is committing the pending changes from memory to disk.

If you are referring to sync in the filesystem level, then I don't know.

comment:18 in reply to: ↑ 11 Changed 7 years ago by dlang

Replying to cjb:

Replying to dlang:

what I'm seeing is that the card is still accessable from the command line, just not from the journal, not that it's completely inacessable.

That's interesting. Did you have to remount it or anything?

no, I am not remounting anything

if I manually unmount the SD card, remove the card and re-insert it, I could sometimes get it to be recognised (I've been updating frequently, so I have seen different results)

comment:19 Changed 7 years ago by eeksock

I'm actually seeing this all the time now in the Update.1-690, presumably because of the much more frequent suspension.

  1. Insert an SD Card, open a Terminal.
  1. Launch Read on a PDF from the Journal.
  1. Wait until Read suspends.
  1. Go back to the Journal.

--> SD Card has disappeared (or repeat as necessary)

  1. Go to the terminal, and enter "df" to verify that the card failed to remount
  1. The last few lines from my dmesg look like this:

[ ... ] ac97 0-0:AD1888: resuming
[ ... ] olpc-dcon: The CPU has control
[ ... ] Restarting tasks ... done.
[ ... ] PM: Adding info for mmc:mmc0:b368
[ ... ] mmcblk0: mmc0:b368 SD 1960448KiB
[ ... ] mmcblk0: p1

Manually mounting /dev/mmcblk0p1 from the Terminal works, but Journal doesn't pick it up.

comment:20 Changed 7 years ago by eeksock

Correction to my last point... Journal does pick it up the mount, it just takes a minute or two. :-)

comment:21 Changed 7 years ago by eeksock

  • Cc eeksock added

comment:22 Changed 7 years ago by tomeu

Ok, so it's working as expected, only than in your case the remount takes too much time.

Could you please attach debug logs and tell how many files you have on your SD card?

http://wiki.laptop.org/go/Attaching_Sugar_Logs_to_Tickets

Thanks!

comment:23 Changed 7 years ago by eeksock

Sorry for the confusion -- it is NOT working as expected. After a suspend from within the Read activity, I nearly always need to manually mount by opening Terminal, and entering "mount /dev/mmcblk01p1 /mount/MYCARD". Only after that does it appear in the mount table (of course) and in Journal (after about 90 seconds).

I will reproduce and attach debug logs.

Changed 7 years ago by eeksock

Logs for eeksock's test case

comment:24 Changed 7 years ago by eeksock

My steps:

  1. Restart machine.
  2. Start Terminal Activity, enter df to verify that "/mount/CRUCIAL2G" (my SD card) appears.
  3. Open Journal, Click on a PDF to launch the Read activity.
  4. Move to a page in the middle, Zoom to Width, and wait until machine Suspends (the power light blinks green).
  5. Alt-Tab to Terminal Activity, entered df multiple times to verify that "/mount/CRUCIAL2G" is gone and doesn't reappear. I did this for several minutes to make sure. I also went to Journal and waited a few minutes, to also make sure it doesn't appear there. In my past experience with Update.1-690, I have waited for up to an hour and the card never reappears. Only if I manually re-mount it does it appear.

I see errors in the Journal log file that look like this:

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/dbus/connection.py", line 214, in maybe_handle_message
    self._handler(*args, **kwargs)
  File "/usr/share/activities/Journal.activity/volumesmanager.py", line 90, in _hal_device_added_cb
    self._add_hal_device(udi)
  File "/usr/share/activities/Journal.activity/volumesmanager.py", line 159, in _add_hal_device
    volume.Mount(mount_point, fs_type, options)
  File "/usr/lib/python2.5/site-packages/dbus/proxies.py", line 136, in __call__
    **keywords)
  File "/usr/lib/python2.5/site-packages/dbus/connection.py", line 607, in call_blocking
    message, timeout)
DBusException: org.freedesktop.Hal.Device.Volume.MountPointNotAvailable: The mount point '/media/CRUCIAL2G' is already occupied
1201460296.917376 ERROR dbus.proxies: Introspect error on :1.5:/org/freedesktop/Hal/devices/volume_uuid_3E30_F2D0: dbus.exceptions.DBusException: org.freedesktop.Hal.NoSuchDevice: No device with id /org/freedesktop/Hal/devices/volume_uuid_3E30_F2D0

Seems related to the problem?

comment:25 Changed 7 years ago by eeksock

Above, I mean "/media/CRUCIAL2G" instead of "/mount/CRUCIAL2G".

comment:26 Changed 7 years ago by tomeu

I think you have found #6029. If you agree, please try again with the next update.1 build, that should fix it.

comment:27 follow-up: Changed 7 years ago by kronenpj

I'm seeing that the partition table on the SD card is getting corrupted. Do we think this is the same bug or should I open a new one? I've seen this behavior after an idle (suspend?) / resume cycle on both candidate-691 and joyride-1713.

comment:28 Changed 7 years ago by kronenpj

I should probably also mention that I'm using a swapfile located on the SD card.
Also Candidate-691 completely corrupted the EXT3 filesystem on the card while joyride-1713 only damaged it a little bit. Both nuked the partition table though.

comment:29 in reply to: ↑ 27 Changed 7 years ago by Andrew Burgess

  • Cc aab@… added

Replying to kronenpj:

I'm seeing that the partition table on the SD card is getting corrupted. Do we think this is the same bug or should I open a new one? I've seen this behavior after an idle (suspend?) / resume cycle on both candidate-691 and joyride-1713.

Are you using swap to the sd card? I used to see corrupt
partition tables alot until I stumbled upon the workaround
of swapoff before powerdown. It happened with last weeks joyride
(1700-ish) when I forgot to swapoff so it hasn't been
fixed yet.

comment:30 Changed 7 years ago by kronenpj

  • Cc kronenpj added

I haven't tried that yet. Hopefully I'll get a chance to work on it over the next few days.

comment:31 Changed 7 years ago by karmaflux

When resuming, the device that went into suspend as /dev/mmcblk01 is rediscovered as /dev/mmcblk02. This means you can't suspend if you booted off your SD card, since you'll lose your root filesystem during resume.

Is this still actually slated for repair in update.1?

comment:32 Changed 7 years ago by helmingstay

I hope to use an SD card for swap sometime in the future, and I'm really interested in hearing about developments in this bug.

comment:33 Changed 7 years ago by frankprindle

Due to ticket #6532 (q.v.), I could never get far enough to find out if this bug still exists in 703 now that auto-suspend has been disabled; it still may be a problem with manual suspend, but no SD card of mine is going back into an XO running build 703 or later until #6532 has been reliably repaired. Hopefully in the process of fixing that (IMHO more critical) bug, this one will get repaired along the way.

Yes, Update.1 should neither be mysteriously unmounting nor corrupting SD cards; to do so would be a serious regression from Ship2.2, which I can reliably boot from an SD card (with the OFW upgraded to Q2D13 or Q2D14) and have it run for days, even weeks without a problem.

comment:34 Changed 7 years ago by cjb

  • Blocking 6893 added

comment:35 Changed 7 years ago by mtd

  • Cc mtd added

comment:36 Changed 7 years ago by cscott

Are all the reporters swapping to SD? From my reading of this bug, it seems they are. If so, this seems to be a relnote bug, not a blocker. Don't swap to SD and suspend, kids.

comment:37 Changed 7 years ago by eeksock

No swapping to SD here, and I still get it with the latest Update.1 candidate if I re-enable suspending the SD card (by default it is now turned off).

All I need to do is:

  1. open a Terminal
  2. mount the SD card and cd to /mnt/sd (I am not currently auto-mounting via Sugar, and am running under Fluxbox)
  3. wait for the SD card to be suspended
  4. wake up

If I wait for more than a couple of seconds, then at this point either when I enter "ls", the SD card contents now shows up empty OR I get scrambled, random characters for the directory contents.

comment:38 Changed 7 years ago by dsaxena

  • Cc dsaxena added

comment:39 Changed 6 years ago by dsaxena

  • Owner changed from dwmw2 to dsaxena
  • Status changed from new to assigned

comment:40 Changed 6 years ago by dsaxena

  • Action Needed set to never set
  • Resolution set to fixed
  • Status changed from assigned to closed

I am marking this as fixed as we have the patch for #6532 in our Joyride builds and in all our kernel branches.

Note: See TracTickets for help on using tickets.