Ticket #6797 (closed defect: fixed)

Opened 6 years ago

Last modified 6 years ago

rainbow-0.7.11 races against X startup

Reported by: mstone Owned by: bobby
Priority: high Milestone: 8.2.0 (was Update.2)
Component: security Version:
Keywords: 8.2.0:+ Cc: tomeu, Blaketh, bemasc, mtd, dgilmore
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking: #7395

Description

preload_common_modules() is called eagerly. This means that we eagerly import gtk which will fail if X has not yet started (as is the case on B2s).

Attachments

rainbow (288 bytes) - added by bobby 6 years ago.
upstart job for rainbow

Change History

  Changed 6 years ago by mstone

  • cc Blaketh added

  Changed 6 years ago by Blaketh

Temporary workaround: renaming /etc/init.d/S50rainbow to /etc/init.d/S98rainbow works around the problem on my B2.

  Changed 6 years ago by bemasc

This problem is also present in Joyride-1896 when running in an Arabic locale. It appears that the additional work of loading/rendering Arabic fonts causes X startup to take longer, which triggers the race condition.

  Changed 6 years ago by bemasc

  • cc bemasc added

Clarification: running joyride-1896 (latest joyride) on B4 hardware, this race condition occurs reliably.

  Changed 6 years ago by tomeu

In Ubuntu, pygtk's failure to open a connection to X will cause a warning, instead of an exception.

Anybody knows the reason for Fedora not doing the same?

  Changed 6 years ago by mtd

  • cc mtd added

METHREE (joyride-1897). MP/C2/G1G1 XO. A lot of times rainbow starts ok :). But sometimes not. Ctrl-Alt-Esc / init 3 ; sleep 5 ; init 5 also of course trigger the problem.

  Changed 6 years ago by mstone

The patches beneath http://lists.laptop.org/pipermail/code-review/2008-April/000000.html contain my best idea so far of how to work around this problem. I'll try to throw them into a build soon, but feel free to race me.

  Changed 6 years ago by tomeu

  • next_action set to never set

Looks like upstream pygtk would just log a warning in that case, but fedora contains a patch that causes it to throw an exception:

pygtk-nodisplay-exception.patch

That patch looks to have been added because of this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=208608

That I cannot access :/

  Changed 6 years ago by mstone

  • cc dgilmore added
  • keywords 8.2.0:? added
  • next_action changed from never set to communicate
  • milestone changed from 8.1.1 (was Update1.1) to 8.2.0 (was Update.2)

Tomeu - please retest with a pygtk built without this patch. If the race still matters, then we should probably figure out how to make rainbow start after X (and before sugar).

  Changed 6 years ago by mstone

  • blocking 7395 added

(In #7395) Faster X startup means that Rainbow is always turning off its preloading ability as a result of its inability to import pygtk. We need to make all its imports succeed even if X is not running or we need to delay loading modules until X is running.

  Changed 6 years ago by bobby

  • keywords 8.2.0:+ added; 8.2.0:? removed
  • owner changed from mstone to bobby
  • next_action changed from communicate to review
  • status changed from new to assigned

I've got preloading working again, using mstone's rainbow-job as a starting point. To reproduce, remove S98rainbow from: /etc/rc.d/rc3.d/ /etc/rc.d/rc4.d/ /etc/rc.d/rc5.d/

and add the (attached) rainbow file to: /etc/event.d/

It would be GREAT if someone with more knowledge of upstart could look over the job. I removed the respawn line Michael suggested becuase it hangs X (or maybe just slows it to a crawl)

also I changed the keyword for this bug, because I think its 'within reach' for 8.2 (as per trac conventions wiki page). if I'm wrong to do this please let me know.

Changed 6 years ago by bobby

upstart job for rainbow

  Changed 6 years ago by bobby

also, should I have left the owner as mstone? I haven't contributed on Trac much before, but assumed since I was the main person dealing with this right now, I was the 'owner'.

  Changed 6 years ago by bobby

rpms are available here:

http://dev.laptop.org/~bobbyp/rainbow/

and source is available here:

http://dev.laptop.org/git?p=users/bobbyp/security

testcase (on a machine with security disabled):

  • boot XO
  • time from when you click on chat activity to when it opens
    • (repeat a couple times to get a decent average)
  • install updated rainbow rpm from Terminal activity
    • wget dev.laptop.org/~bobbyp/rainbow/rainbow-0.7.18-1.olpc3.noarch.rpm
    • sudo rpm -Uvh dev.laptop.org/~bobby/rainbow/rainbow-0.7.18-1.olpc3.noarch.rpm
  • reboot
  • make sure that init doesn't show a message about starting the rainbow service
  • time from when you click on chat activity to when it opens
    • should be around 1/2 as long before
  • rejoice!

  Changed 6 years ago by erikos

@bobby: for the testcase please use the format described here: http://sugarlabs.org/go/DevelopmentTeam/CodeReview#Patch_submission

  Changed 6 years ago by bobby

|TestCase|

Power on an XO running a recent joyride with security disabled (so that there is no pretty boot). Once the home view and journal finish loading, time how long it takes from when you click on the chat activity to when it opens. Install the updated Rainbow RPM from the terminal (see below) and reboot. Make sure init does not have a message about starting rainbow. From the home view, time how long it takes to open chat - it should open about twice as fast.

To install the updated Rainbow from terminal:

wget dev.laptop.org/~bobbyp/rainbow/rainbow-0.7.18-1.olpc3.noarch.rpm

sudo rpm -Uvh rainbow-0.7.18-1.olpc3.noarch.rpm

  Changed 6 years ago by mtd

Installed the rpm. Chat started superfast (3-5s).

  Changed 6 years ago by mstone

  • next_action changed from review to add to build

Built as rainbow-0.7.18-1.fc9 and tagged into dist-olpc3. As you can see by reviewing the rainbow-0.7.18 tag in the security repo, I made small modifications to the packaging; mainly removing the superfluous rainbow initscript. Hopefully, no errors were introduced. We'll find out soon enough.

  Changed 6 years ago by mtd

Wait; I think I found a problem: every time I restart X I get a new rainbow-daemon. I only noticed this today when I was restarting X a lot (don't ask). Eventually I ran out of memory (and swap) and figured out it was because I had about 15 rainbow-daemons (I run compcache so have "swap" which allowed me to last this long).

Could it just be some upstart config issue?

Example:

-bash-3.2# ps axuwww | grep rainbow
root      547  0.0  9.0  38328 21228 ?        Ss   12:43   0:01 python /usr/sbin/rainbow-daemon --daemon
root     2141  0.0  8.9  38376 21208 ?        Ss   12:49   0:01 python /usr/sbin/rainbow-daemon --daemon
root     3537  0.0  0.2   2048   564 pts/3    S+   22:46   0:00 grep rainbow
-bash-3.2# init 3
-bash-3.2# init 5
-bash-3.2# !ps
ps axuwww | grep rainbow
root      319  0.0  0.2   2044   520 pts/3    S+   22:48   0:00 grep rainbow
root      547  0.0  9.0  38328 21228 ?        Ss   12:43   0:01 python /usr/sbin/rainbow-daemon --daemon
root     2141  0.0  8.9  38376 21208 ?        Ss   12:49   0:01 python /usr/sbin/rainbow-daemon --daemon
root     4040 20.3  8.9  38376 21208 ?        Ss   22:48   0:01 python /usr/sbin/rainbow-daemon --daemon
-bash-3.2# 

follow-up: ↓ 22   Changed 6 years ago by bobby

mtd: yes, I've been looking at this all afternoon, it is because we're starting rainbow with --daemon. If we do this, upstart doesn't realise that rainbow started correctly and doesn't keep track of it. If we don't do this, upstart keeps track of it fine, but /sbin/start rainbow doesn't exit properly from the command line (although it doesn't seem like an issue on startup). I've build a new rpm that should stop multiple rainbows from spawning, and includes the fix for #7763:

http://dev.laptop.org/~bobbyp/rainbow/rainbow-0.7.18-2.olpc3.noarch.rpm

follow-up: ↓ 21   Changed 6 years ago by mstone

Just in case we can't figure out how to make upstart dance to our tune, Albert suggested a couple of Unixy techniques for working out the issue:

1. Have rainbow start X via 'clone(SIGKILL); exec(X)', then wait for XAUTHORITY to appear, then run normally. Rainbow will die when X does.

2. Have xinit start Rainbow (need to be careful of xinit vs. its children) and use prctl(PR_SET_PDEATHSIG, SIGKILL, ...) to die when X does.

3. Have rainbow become an X client via Xlib (libX11) which kills X clients when the X server dies.

P.S. - The current experiment is with 'stop on stopping prefdm' or 'stop on stopped prefdm' and similar.

in reply to: ↑ 20   Changed 6 years ago by marco

Replying to mstone:

3. Have rainbow become an X client via Xlib (libX11) which kills X clients when the X server dies.

Shouldn't this be already the case, since you import gtk?

in reply to: ↑ 19   Changed 6 years ago by mtd

Replying to bobby:

I've build a new rpm that should stop multiple rainbows from spawning, and includes the fix for #7763: http://dev.laptop.org/~bobbyp/rainbow/rainbow-0.7.18-2.olpc3.noarch.rpm

Thanks - after rpm -Uvh'ing that and rebooting, I no longer see multiple rainbows.

rainbow doesn't restart when X does, though - is that the way it's supposed to happen?

-bash-3.2# rpm -q rainbow
rainbow-0.7.18-2.olpc3.noarch
-bash-3.2# ps axuwww  | grep rainbow
root      743  0.0  0.2   2044   520 pts/3    S+   12:12   0:00 grep rainbow
root     1267  0.5  9.7  38388 22984 ?        Ss   11:53   0:06 python /usr/sbin/rainbow-daemon
-bash-3.2# init 3
-bash-3.2# ps axuwww  | grep rainbow
root     1020  0.0  0.2   2044   524 pts/3    S+   12:13   0:00 grep rainbow
root     1267  0.5  9.7  38388 22984 ?        Ss   11:53   0:06 python /usr/sbin/rainbow-daemon
-bash-3.2# init 5
-bash-3.2# ps axuwww  | grep rainbow
root     1267  0.5  9.7  38388 22984 ?        Ss   11:53   0:06 python /usr/sbin/rainbow-daemon
root     1446  0.0  0.2   2048   564 pts/3    S+   12:14   0:00 grep rainbow
-bash-3.2#

PS - I'm running olpc-dm with cscott's/my patch to #5705, just in case that makes a difference.

  Changed 6 years ago by bobby

http://koji.fedoraproject.org/koji/taskinfo?taskID=756196

has rpms where rainbow restarts along with X, but I am not sure that the uninstall works correctly (those 0.7.18-2 ones don't uninstall).

  Changed 6 years ago by mstone

  • status changed from assigned to closed
  • next_action changed from add to build to no action
  • resolution set to fixed

No further action is needed on this ticket at this time.

Note: See TracTickets for help on using tickets.