Ticket #5954 (new defect)

Opened 6 years ago

Last modified 6 years ago

oom-killer during olpc-update

Reported by: DanKrejsa Owned by: cscott
Priority: high Milestone: 8.2.0 (was Update.2)
Component: upgrade utility Version:
Keywords: oom-killer olpc-update blocks-:8.2.0 Cc: kimquirk, chihyu, rsmith, ridderman, mtd, joe, gregorio
Action Needed: test in release Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

I just tried olpc update from joyride-1500 to joyride-1525, on a G1G1 XO. I ran

sudo olpc-update joyride-1500

from the terminal activity, and sometime during (or after) the irsync-dirty part of the update, X restarted. I thought the laptop was rebooting, but examining /var/log/messages later I saw that 'NetworkManager invoked oom-killer'.

I've put the log at http://pastebin.ca/849111 . (Also attached.)

I had no other activities than terminal (& Journal) running at the time, and wasn't touching the laptop. I suspect there may be something random to this that may make it hard to reproduce.

I later succeeded doing the update from a usb key using

olpc-update --full --usb

That could be a workaround if this happens frequently.

I would guess there are two aspects to this -- the general handling of out-of-memory conditions by the XO, and the particular usage of memory during olpc-update that made the system run into the limit.

Attachments

messages (59.9 kB) - added by DanKrejsa 6 years ago.
/var/log/messages after oom-killer during olpc-update
pre_update_ps (6.5 kB) - added by DanKrejsa 6 years ago.
'ps auxww' output just before olpc-update attempt.
oom.log (112.5 kB) - added by chihyu 6 years ago.
oom2.log (120.9 kB) - added by chihyu 6 years ago.
dmesg.gz (16.9 kB) - added by Quozl 6 years ago.
messages.gz (16.2 kB) - added by Quozl 6 years ago.
ridderman_messages (72.1 kB) - added by ridderman 6 years ago.
/var/log/messages
ridderman_sugar_logs_1218222905.tar (80.5 kB) - added by ridderman 6 years ago.
logger2.tar.gz (18.0 kB) - added by ridderman 6 years ago.
ridderman_logger4.tar.gz (28.7 kB) - added by ridderman 6 years ago.

Change History

Changed 6 years ago by DanKrejsa

/var/log/messages after oom-killer during olpc-update

  Changed 6 years ago by jg

  • owner changed from jg to cscott
  • priority changed from normal to high
  • component changed from distro to upgrade utility
  • milestone changed from Never Assigned to Update.1

Thanks for the report.

Had your system been running for an extended period before the update?

It would be very helpful if you could do a ps aux and grab it before doing updates, so we can see if there is a memory leak in some process doing us in before you even start...

  Changed 6 years ago by DanKrejsa

Hi, My XO had not been running for long before I started the update. The next time I try an update, though, I'll capture the ps output before I start it.

  Changed 6 years ago by DanKrejsa

Hi, Last night I tried upgrading from joyride-1525 to joyride-1536. This time, the upgrade process itself apparently went flawlessly; in fact, this is the first time I've seen the irsync-dirty phase succeed without getting an ugly error and having to fall back upon the full rsync. It may be of limited value since this upgrade succeeded, but I'll attach the requested 'ps auxww' output, as well as a log of the olpc-update output. I figure this is either intermittent or fixed. Is there any other info that it would help to collect either before an upgrade attempt, or after the problem recurs (if it does)? If not, I'll follow my nose.

Changed 6 years ago by DanKrejsa

'ps auxww' output just before olpc-update attempt.

  Changed 6 years ago by DanKrejsa

Somehow lost the olpc-update output, but it probably wasn't very interesting anyhow. (BTW, trying to upload the attachment to this ticket from the XO's Browse activity was quite awkward, given that the file wasn't in the Journal. Sigh. Wasn't helped that with the current image, my SDHC card seems to be dying when the XO goes to sleep. Sigh Sigh. ;)

  Changed 6 years ago by cscott

  • priority changed from high to normal
  • milestone changed from Update.1 to Retriage, Please!

This seems to be a problem with memory usage on the system and the oom killer in general, rather than in olpc-update -- although I'd be very interested in hearing whether olpc-update is using "abnormal" amounts of memory. Not sure what the next step on this bug would be.

follow-ups: ↓ 7 ↓ 10   Changed 6 years ago by jg

  • cc kimquirk, chihyu added

The next step is much more extensive update testing, to see if we see it much.

I'm going to retarget to Update.2 for now: we'll monitor this during testing. Kim, Chihyuh, please keep your eye out for instances of this.

in reply to: ↑ 6 ; follow-up: ↓ 9   Changed 6 years ago by chihyu

I cannot reproduce this by olpc-update joyride-1541 from joyride-1537 with firmware Q2D08, but I will keep paying attention to this every time I upgrade.

Replying to jg:

The next step is much more extensive update testing, to see if we see it much. I'm going to retarget to Update.2 for now: we'll monitor this during testing. Kim, Chihyuh, please keep your eye out for instances of this.

  Changed 6 years ago by jg

  • milestone changed from Retriage, Please! to Update.2

in reply to: ↑ 7   Changed 6 years ago by chihyu

I have been informed that a "memory error" I encountered (#5805) should be related to this issue.

Replying to chihyu:

I cannot reproduce this by olpc-update joyride-1541 from joyride-1537 with firmware Q2D08, but I will keep paying attention to this every time I upgrade. Replying to jg:

The next step is much more extensive update testing, to see if we see it much. I'm going to retarget to Update.2 for now: we'll monitor this during testing. Kim, Chihyuh, please keep your eye out for instances of this.

in reply to: ↑ 6   Changed 6 years ago by chihyu

The same issue occurred again on two MP machines, with joyride-1535, firmware q2d08a.

  • Hit Ctrl-Alt-Mesh
  • (login as root) Run olpc-update update.1-690
  • System returns "out of memory"

Please see attachment for details.

Replying to jg:

The next step is much more extensive update testing, to see if we see it much. I'm going to retarget to Update.2 for now: we'll monitor this during testing. Kim, Chihyuh, please keep your eye out for instances of this.

Changed 6 years ago by chihyu

Changed 6 years ago by chihyu

  Changed 6 years ago by chihyu

  • cc rsmith added

  Changed 6 years ago by Quozl

  • next_action set to never set
  • verified set

Reproduced on Joyride 2269. New logs to be attached. Circumstances were olpc-update within Terminal.

Changed 6 years ago by Quozl

Changed 6 years ago by Quozl

  Changed 6 years ago by ridderman

  • cc ridderman added
  • keywords blocks?:8.2.0 added
  • priority changed from normal to high

I've been experiencing this on multiple occasions for months, and just recently found this bug. About 50% of the time that I try olpc-update, Sugar restarts. I used to get MemoryError tracebacks on the console, but more recently it fails with a Sugar restart due to the oom condition. I'm guessing that this is not happeninng to many people, or else it would have been noticed and fixed by now. Here is my setup:

G1G1, upgraded to several Update1 candidates, then many joyride images. When I run olpc-update, I su -l in the Terminal. I make sure that all activities are closed besides Terminal. I used to not know any better and would have one or two activities open. I think, anecdotaly, that if I've opened and closed other activites prior to the Terminal, the oom is more likely to occur. This might be because more memory is reserved for the system cache. Also, irsync_pristine fails every time on the file "localtime". irsync_dirty usually either works or the oom happens here. A few times, it's failed with a traceback and rsync has worked. I'll attach my /var/log/messages and sugar logs.

If this is happening to all G1G1, then I would consider it an 8.2.0 blocker. I've added the blocks?:8.2.0 keyword to get others' take on the situation.

Changed 6 years ago by ridderman

/var/log/messages

Changed 6 years ago by ridderman

  Changed 6 years ago by mtd

  • cc mtd added

Sorry for the spam, but as people seem to be finidng this ticket it may help them to know about compcache from #28: http://dev.laptop.org/ticket/28#comment:19

  Changed 6 years ago by cscott

Can I get an actual memory usage profile from a machine showing this problem, just before it crashes? I can't tell if the root cause is that Browse (say) is leaking, and sometimes olpc-update pushes the whole system over the edge, or if olpc-update really is taking an abnormal amount of memory itself (I don't think that it is).

  Changed 6 years ago by cscott

Oh, also: knowing exactly *where* the memory error is happening would help a lot as well. Bug #5805 seems to indicate a memory failure when we read() an entire file into memory in order to compute its hash; this probably has bad behavior if there are large files in the system (like, for instance, if the user has manually installed additional packages which include some very large files). That's probably a bug I can fix easily.

  Changed 6 years ago by ridderman

I will run olpc-update with -v next time and pipe the output to a file. This should provide enough info about where the crash occurs. I also wrote up a script to log top to a file every 5 seconds. Does this sound like enough memory info?

Also, I can tell you that free generally reports ~40 MB of free memory before I run olpc-update.

Changed 6 years ago by ridderman

  Changed 6 years ago by ridderman

I've attached logs of olpc-update output as well as running top every 10 seconds in a python script. I think I had ~ 20 MB of available memory before the update (fresh reboot), and it ended up restarting Sugar. I was trying to update from 2346 to 2363. It seems that the system should have more available memory, and that olpc-update should be more tolerant of it and fail gracefully.

  Changed 6 years ago by ridderman

I've attached another set of logs. This time I sucessfully upgraded from 2346 to 2373, however the dirty irsync failed with a Python Memory error. Because Python caught it, the irsync failed gracefully and the rsync worked. Note that I had much more available memory this time before the update (~45 MB). I had used the laptop quite a bit since the last restart but my available memory actually increased over time. I did not open Browse and did not acquire an internet connection - not sure if this is related.

The memory error is the same one that I've seen in the past:

File "/usr/lib/python2.5/site-packages/bitfrost/contents/utils.py", line 100, in mkdirobject

entryh? = mkhashes(file(fullname).read())

MemoryError

This happens is in the "Creating contents for existing tree" step. See the logs for more detail. This step is also where I've seen the OOM failures.

Now that I've upgraded to 2373, I have the new olpc-update which has a fix for #8190. I'll post my results the next time I get a chance to update my laptop.

Changed 6 years ago by ridderman

  Changed 6 years ago by cscott

There were fixes to the MemoryError and dirty /etc/localtime problems landed in joyride-2369; I believe this will fix your OOM. Please test.

  Changed 6 years ago by mstone

  • next_action changed from never set to add to build

Scott - please push this into the 8.2 release build when you are satisfied with your patches. I look forward to seeing your ChangeLog entry.

  Changed 6 years ago by mstone

  • keywords blocks-:8.2.0 added; blocks?:8.2.0 removed

This will not block the release. However, since Scott already has changes in the pipeline for it, intend to try to accept them.

  Changed 6 years ago by cscott

  • next_action changed from add to build to test in release

Package involved: olpc-contents 2.5-1

Added to stable stream; should be in release 758 and following.

  Changed 6 years ago by ridderman

I upgraded from joyride-2373 to joyride-2386 and all worked well! irsync_pristine worked and I did not have any memory errors. I can't test the memory errors because they would not occur until the irsync_dirty stage.

FYI, I had about 20 MB of free memory before the update.

  Changed 6 years ago by joe

  • cc joe, gregorio added

While updating to 8.2-759 (olpc-update), I noticed that (just as with previous updates) sometimes Sugar restarts automatically immediately after an update is completed. In this case the Control Panel still shows the presence of the previous build, which gets fixed after the machine is rebooted.

This usually happens with 10% - 15% of updated machines.

Note: See TracTickets for help on using tickets.