Opened 7 years ago

Closed 7 years ago

Last modified 5 years ago

#5744 closed defect (fixed)

"Resuming" a large file from USB copies it into NAND (filling NAND)

Reported by: gnu Owned by: tomeu
Priority: high Milestone: Update.1
Component: sugar-datastore Version: Build 653
Keywords: relnote Cc: jg, kimquirk, legutierr, mstone, krstic, walter
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: no

Description

(From rt# 2968) User was browsing media from a USB drive, using Watch & Listen. Their internal flash filled up, and it made their system unbootable.

I tried to reproduce it tonight. Easy.

I installed Watch&Listen from the wiki Activities page, put a 300+MB mp3 file on a USB memory stick, inserted it into the XO, went to the journal, clicked on the USB logo, found the file (despite the Journal tearing off half of its file name), clicked it to get details, hovered over the upper right icon, and resumed it with Watch & Listen. I had less than 300MB free in my internal NAND flash.

Something started copying the file into internal flash. From previous experience (#5719) I looked at the datastore process. It had 21 file descriptors open (in /proc/1618/fd), and 20 and 21 were:

20 -> /media/COMPUSA/raiders_of_old_california.mp3

21 -> /home/olpc/.sugar/default/data/raiders_of_old_california.mp3

The file was copied until the NAND filesystem filled up. (9 MB free). A truncated copy of the file was left in .sugar/default/data.

After I closed the "detail" page from the Journal, a second copy of the file started to get made in the data/ directory!!! It didn't get very far, since there was minimal space left.

The Journal was unable to unmount the USB stick, because "umount" failed with a file system full error.

Earlier, I tried the same thing with a 15MB mp3 file. The file also got copied to the datastore. It was there while the file was played by the media player. It was removed after the media player terminated.

Then I rebooted and tried the same thing with a 125MB Ogg Theora file from the Ubuntu Month of Screencasts (http://screencasts.ubuntu.com/videos/20070911_updating_and_upgrading_theora_400k_vorbis_1280x720.ogg). The Watch&Listen activity started up, even while the USB light was blinking as the file was copied -- but the activity could not access the file. The Play button was useless; the timeline didn't show the length of the video file. Not until the entire file was copied -- more than a minute. Ultimately it never was able to play the file at all -- I don't know why. But TWO copies of it were made in default/data -- not two links to the same file, but two separate copies. I don't know why. I terminated the program, yet the file copies remained! When I tried to resume the same file with EToys, it made a THIRD copy, which filled the file system, causing all kinds of mischief including making X crash and restart every 20 seconds or so.

After deleting the files to make the X crashes stop, I then went back and tried resuming this file with eToys. eToys started up, said "Opening journal entry 20070911_updating_and_upgrading_theora_400k_vorbis_1280x720", then shortly afterward, said "Error: dbus send timed out". Apparently it was copying the file yet again, and etoys was too impatient to wait. Yep, even after it timed out, the USB access light is flashing, and now another copy is appearing.

There's something deeply broken about how Sugar, the Journal, and the Datastore deal with files -- particularly on removable or networked media.

The current design precludes an activity from working with any file larger than the available NAND space (1GB maximum). No activity is ever going to play a DVD off a USB DVD drive. No activity is going to be able to record more than 1GB of video from the onboard video camera, no matter if 300GB of hard-drive space is available (either plugged in on USB, or over the network in a school server).

Even if the current policy of copy-everything was defensible, it's incredibly inefficient. The activity can't use the file until it has been fully copied (and automatically compressed!). Then the copy has to be deleted. Both operations are expensive and useless, when all that is really needed is an open().

Instead, like everything else on Linux, activities should be reading the file(s) they operate on directly from whatever filesystem they are on -- whether that's local or remote, removable or not.

Change History (21)

comment:1 Changed 7 years ago by gnu

  • Cc jg kimquirk added

comment:2 Changed 7 years ago by legutierr

  • Cc legutierr added

comment:3 Changed 7 years ago by jg

  • Milestone changed from Never Assigned to Update.1

yuck... Filling nand with even a temporary file sucks. We have to be able to play in place....

comment:4 Changed 7 years ago by marco

  • Cc mstone krstic added

comment:5 Changed 7 years ago by tomeu

We chose to not allow direct access to files in usb sticks because we didn't find a good solution from the security point of view that we felt we were able to implement in time for update.1. That area in the DS is already quite fragile and we lack resources to safely mess in there.

Do we really need this feature for update.1? Cannot release note it?

Should the DS instead do something to limit the amount of space that it can take on NAND?

comment:6 Changed 7 years ago by jg

  • Keywords relnote added; security? removed
  • Milestone changed from Update.1 to Update.2

We must fix this for Update.2. We need *something* to avoid this failure mode in update.1....

It really isn't acceptable to fill the NAND, thereby inducing multiple failures....

comment:7 Changed 7 years ago by jg

  • Cc walter added

Worst of all, the more full NAND becomes, the less space you have and the more likely you are to want to use USB/SD....

We may need to temporarily sacrifice security goals on this issue. I'd like to hear from Walter and Ivan on this.

comment:8 Changed 7 years ago by walter

Obviously we need to fix this, but for what release? It is not clear from the thread how invasive a change this is to the Datastore/Journal. If it requires more than a small patch, I don't think we can do it for Update.1.

comment:9 Changed 7 years ago by jg

It sounds from Tomeu's mail that the easy solutions require relaxing security.

Tomeu, can you give us more insight into the implementation tradeoffs?

comment:10 Changed 7 years ago by jg

I meant to say, Tomeu's reply...

comment:11 follow-up: Changed 7 years ago by walter

While it was clear from Tomeu's reply that the motivation for file copy was a security concern, it wasn't clear that there was an easy fix, regardless of whether or not these concerns were relaxed. Also, it is not clear what security concern is addressed by the file copy. Please do give us more insight into the problem.

comment:12 in reply to: ↑ 11 Changed 7 years ago by gnu

Replying to walter:

Also, it is not clear what security concern is addressed by the file copy.

If the datastore handed over a file descriptor, opened for reading only, to the activity, it's not clear what mischief the activity could do. fchmod perhaps? How much do we care?

(Note: Unix domain sockets can pass file descriptors from process to process. Also, if we used this capability, it would be yet another way that Sugar activities would break when run like any other X client (over a network).]

...it wasn't clear that there was an easy fix...

Yes. it appears that an API change is needed, which will require changes in all activities. The current API requires the activity to open, use, and then *delete* the file whose name it was handed by the datastore!

comment:13 Changed 7 years ago by krstic

Passing fds is how I originally proposed this is implemented. A saner API will be introduced during the datastore rewrite starting in January; you should consider the current datastore as a (bad) proof of concept.

comment:14 follow-ups: Changed 7 years ago by tomeu

This seems to make the trick. Any better idea?

diff --git a/src/olpc/datastore/backingstore.py b/src/olpc/datastore/backingstore.py
index e900649..1083ba1 100644
--- a/src/olpc/datastore/backingstore.py
+++ b/src/olpc/datastore/backingstore.py
@@ -410,14 +410,13 @@ class FileBackingStore(BackingStore):
             if e.errno != errno.EPERM:
                 raise
 
-        # Try to link from the original file to the targetpath. This can fail if
-        # the file is in a different filesystem. Do a copy instead.
         try:
             os.link(path, targetpath)
         except OSError, e:
             if e.errno == errno.EXDEV:
-                shutil.copy(path, targetpath)
-                os.chmod(targetpath, 0604)
+                # The hard link failed because the file is in another file system.
+                # Return the original file path instead of a link.
+                targetpath = path
             else:
                 raise
             

comment:15 in reply to: ↑ 14 Changed 7 years ago by tomeu

Replying to tomeu:

This seems to make the trick. Any better idea?

Just wanted to warn that this changes the semantics of the API, as until now the returned file path pointed to a temp file that the caller was responsible for cleaning up. This means in practice that activities will need to be aware that the file is in a removable device and that must not delete the file once it has finished working with the object.

If activities don't adapt to this API change, files will be deleted from the usb stick.

comment:16 in reply to: ↑ 14 ; follow-up: Changed 7 years ago by gnu

Replying to tomeu:

This seems to make the trick. Any better idea?

Changing the semantics of Activity<->Datastore communication seems like a terrible way to fix this soon.

Have you considered making a symlink from the Datastore directory to the actual file on another filesystem, when a hard link fails? Then the activity can delete the symlink without deleting the actual file.

comment:17 in reply to: ↑ 16 Changed 7 years ago by tomeu

Replying to gnu:

Have you considered making a symlink from the Datastore directory to the actual file on another filesystem, when a hard link fails? Then the activity can delete the symlink without deleting the actual file.

Yeah, if there's no better solution that's what I would implement. Should be equally easy.

The only problem I see is that usb sticks will need to be readable for all activities. Michael, is that ok?

comment:18 Changed 7 years ago by tomeu

Simple patch for doing a symlink instead of a copy:

diff --git a/src/olpc/datastore/backingstore.py b/src/olpc/datastore/backingstore.py
index e900649..1af0088 100644
--- a/src/olpc/datastore/backingstore.py
+++ b/src/olpc/datastore/backingstore.py
@@ -410,14 +410,13 @@ class FileBackingStore(BackingStore):
             if e.errno != errno.EPERM:
                 raise
 
-        # Try to link from the original file to the targetpath. This can fail if
-        # the file is in a different filesystem. Do a copy instead.
+        # Try to hardlink from the original file to the targetpath. This can
+        # fail if the file is in a different filesystem. Do a symlink instead.
         try:
             os.link(path, targetpath)
         except OSError, e:
             if e.errno == errno.EXDEV:
-                shutil.copy(path, targetpath)
-                os.chmod(targetpath, 0604)
+                os.symlink(path, targetpath)
             else:
                 raise
             

Anybody sees any problem with this approach?

comment:19 Changed 7 years ago by jg

  • Milestone changed from Update.2 to Update.1

This is a pretty small patch, and the problem pretty huge.

Unless there are objections, I'm going to target to Update.1, presuming it passes review.

comment:20 Changed 7 years ago by mstone

I spoke with Ivan a few minutes ago and we both agree that the best we can do for the duration of Update.1 is to give activities free reign to access the USB stick with the caveat that we'd like to revisit this decision in the future to look for a more satisfying solution.

If providing symlinks seems like the lowest-cost way of preserving the existing datastore/activity API then let's try it out and see what happens.

comment:21 Changed 7 years ago by tomeu

  • Resolution set to fixed
  • Status changed from new to closed

Verified fixed in joyride 1522. See #5920 for the approval request.

Note: See TracTickets for help on using tickets.