Ticket #9623 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

test/fix disk full behaviour

Reported by: dsd Owned by: dsd
Priority: normal Milestone: 10.1.2
Component: distro Version: 1.5 Software Build os206 aka 10.1.1
Keywords: Cc:
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

In 8.2 we took steps towards putting various files (e.g. Xauthority) in /tmp so that X could still launch on disk full.

Probably need to do the same things again.

Attachments

0001-ignore-logging-cleanup-fail-sugarlabs.org-1720-lapto.patch (1.0 kB) - added by Quozl 4 years ago.
Backport of SL 1720 fix to Sugar 0.84
0001-fix-ignore-logging-cleanup-fail-laptop.org-9623.patch (1.1 kB) - added by Quozl 4 years ago.
Tested os303 with full disk, failed with warning about exception and the same error. Previous patch in this ticket was incorrect; the call to cleanup_logs() was indeed protected by an exception handler, but then it was called again. This patch removes the extra call.

Change History

  Changed 5 years ago by Quozl

  • next_action changed from never set to reproduce

triage. agreed.

on 802 the user olpc has XAUTHORITY set by olpc-dm to /var/log/olpc-auth/.Xauthority, and /var/log is on tmpfs.

on 802 we also did work to reduce writes to filesystem by placing frequently changed files. see /etc/rwtab, /etc/statetab and /proc/mounts.

other possibilities:

  • partition user storage separate to operating system storage, can do more easily now that we have a flash translation layer,
  • disk quotas.

follow-up: ↓ 5   Changed 5 years ago by cjb

I don't think we need to work on reducing writes for its own sake anymore, since that was a pathology specific to jffs2, but I agree that we should make sure our disk full behavior is at least as good as 802.

  Changed 5 years ago by Quozl

On os34 and os36 as root:

  • cat < /dev/zero > /big-file until error is generated showing disk is full,
  • sync and wait for completion (several minutes),
  • Ctrl-Alt-Erase and X server loops attempting to restart,
  • on os34 power cycle results in the multilingual text console DISKFULL warning, a Journal Full warning, and eventually a functioning system,
  • on os36 power cycle results in hang in boot animation, no obvious way to recover, used fs-update to recover.

Note that this is an unfair test, as the protection afforded by the reserved blocks feature of ext3 was not employed.

  Changed 5 years ago by Quozl

Did some more testing. Conclusions:

  • filling disk as user may cause a long reboot delay, but otherwise is a recoverable situation,
  • ext3 write rate is much lower than read rate,
  • ext3 fsck on full disk takes almost three minutes.

On os34 as olpc:

  • time (dd if=/dev/zero of=big bs=1024; sync)
  • total data written 2403237888 (2.4Gb),
  • total running time 12min 57sec,
  • calculated data rate 2.9MByte/sec,
  • reading back the file took 2min 21sec, or 16.25MByte/sec,
  • Ctrl-Alt-Erase and X server restarts successfully, Sugar prompts with "Your Journal is full" and Terminal can be started,
  • forced power cycle, results in hang in boot animation, didn't diagnose further.

On os36 as olpc:

  • same use of dd,
  • total data written 2321461248 (2.3Gb),
  • total running time 12min 38 sec,
  • Ctrl-Alt-Erase and X server restarts repeatedly,
  • forced power cycle, results in hang in boot animation, single dot, repeating using tick key on boot ... the last line logged shows it hung doing an ext3 fsck ... dmesg showed it took 296 seconds to fsck, then the system operated normally,
  • a normal reboot results in the DISKFULL warning, the deletion of several activities, and the Journal Full warning.

in reply to: ↑ 2   Changed 5 years ago by mstone

Replying to cjb:

I don't think we need to work on reducing writes for its own sake anymore, since that was a pathology specific to jffs2, but I agree that we should make sure our disk full behavior is at least as good as 802.

For the record, the primary reason for introducing all these tmpfsen was to speed up olpc-update by enabling use of the faster "pristine-tree" update algorithm as opposed to the more expensive "dirty-tree" update algorithm.

follow-up: ↓ 7   Changed 5 years ago by dsd

I would think that any reason to reduce writes on jffs2 still applies because at the end of the day there is something similar running on the SD card microcontroller.

and yes, pristine update is something that we're interested in

in reply to: ↑ 6   Changed 5 years ago by cjb

Replying to dsd:

I would think that any reason to reduce writes on jffs2 still applies because at the end of the day there is something similar running on the SD card microcontroller.

It depends what the reason to reduce writes was. :)

My understanding is that each short write created a new jffs2 node, which made the tree-building stage of mounting jffs2 take longer each boot. So, purely a problem of software data structures related to the SD microcontroller.

  Changed 5 years ago by cjb

.. *un*related to the SD microcontroller, sorry.

  Changed 5 years ago by triagebot

  • milestone changed from 1.5-software to 1.5-software-final

changed by irc user Quozl:

  Changed 5 years ago by Quozl

  • milestone changed from 1.5-software-final to 1.5-software-update

Ticket moved out of 1.5-software-final to 1.5-software-update as a result of a software manufacturing release triage meeting. Per ed, dsd, cjb, reuben, quozl.

  Changed 5 years ago by Quozl

Tested on os110-2g on XO-1.5 C1:

  • same use of dd as above,
  • total data written 269582336 bytes (270MB),
  • total running time 1min 49 sec,
  • Ctrl-Alt-Erase and X server restarts repeatedly, until respawning too fast occurs in /var/log/messages, see below for cause,
  • forced power cycle, results in a short hang in boot animation, single dot, (probably ext3 fsck), then the system operated normally,
  • the DISKFULL warning appeared, followed by the deletion of Wikipedia activity, but not the Journal Full warning.

/tmp/olpc-dm-client.error.log showed sugar-session failure on X server restart was due to requiring free space to create a log cleanup directory:

Traceback (most recent call last):
  File "/usr/bin/sugar-session", line 185, in <module>
    main()
  File "/usr/bin/sugar-session", line 141, in main
    cleanup_logs()
  File "/usr/bin/sugar-session", line 80, in cleanup_logs
    os.mkdir(backup_dir)
OSError: [Errno 28] No queda espacio en el dispositivo: '/home/olpc/.sugar/default/logs/1265854382'

  Changed 5 years ago by cjb

It's hard to know what to do about this. We do want logs/ to be on-disk so that events leading up to a system crash can be examined after reboot. The disk full script kicked in appropriately on reboot.

So, barring a decision to move logs/ to RAM, I think the best we could do is to catch errors from sugar's attempt to write to its logfiles, and ignore them or switch to logging on tmpfs instead.

  Changed 5 years ago by Quozl

logs/ being on disk is secondary to system usability, when this choice must be made.

Sugar did have this code protected by an exception catcher but it was removed in commit 662dcc82e, during a redesign, where logging was moved to a service ... then it was left out when log cleanup was brought back into sugar-session (b75ec16b1). I'll suggest a patch.

This also slightly relates to #9924 (logs opened O_SYNC) ... choice between user performance and useful logs for infrequent post-event analysis.

  Changed 5 years ago by Quozl

Ticket sugarlabs.org #1720 raised for the failure to start Sugar when disk is full or read-only.

  Changed 4 years ago by Quozl

  • next_action changed from reproduce to review
  • version changed from not specified to 1.5 Software Build os206 aka 10.1.1
  • milestone changed from 1.5-software-later to 10.1.2

Ticket sugarlabs.org #1720 was closed after a patch was merged to HEAD. This didn't fix it for 0.84. I'll rebase the patch for 0.84.

Changed 4 years ago by Quozl

Backport of SL 1720 fix to Sugar 0.84

  Changed 4 years ago by dsd

  • next_action changed from review to test in build

test sugar-0.84.19 in next build

Changed 4 years ago by Quozl

Tested os303 with full disk, failed with warning about exception and the same error. Previous patch in this ticket was incorrect; the call to cleanup_logs() was indeed protected by an exception handler, but then it was called again. This patch removes the extra call.

  Changed 4 years ago by Quozl

  • next_action changed from test in build to review

XO-1.5 10.1.2 os303 (4Gb image) took 14 minutes to fill disk, 5 minutes to read the file back (2Gb). With the fix above, Sugar displayed the journal full dialog after Ctrl/Alt/Erase. On reboot the "Not enough disk space." message appeared, and Wikipedia activity was deleted.

  Changed 4 years ago by cjb

Sounds good to me; go ahead and close if you're comfortable with the behavior now.

  Changed 4 years ago by Quozl

  • next_action changed from review to package

Failed to set ticket next action correctly, sorry. I've committed a fix to git, it needs to be picked up in the next point release of 0.84.

http://git.sugarlabs.org/projects/sugar/repos/mainline/commits/9b593b8bf176738ba0dceba78d4d2149c17db18d

  Changed 4 years ago by dsd

  • next_action changed from package to test in build

sugar-0.84.22 queued for next build

  Changed 4 years ago by Quozl

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

Tested in os850 on XO-1.5 using above reproducers. Your journal is full correctly appears and Sugar does not take an exception on startup. The code change is also present.

Note: See TracTickets for help on using tickets.