Ticket #801 (closed defect: fixed)

Opened 22 months ago

Last modified 20 months ago

Sugar is crashing repetitively.

Reported by: jg Owned by: J5
Priority: high Milestone: Trial-1
Component: distro Version:
Keywords: Cc: dcbw@…
Action Needed: Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

On a PreB test machine, using build 239, Sugar is terminating repeatedly before doing anything useful.

I reinstalled 239 on this hardware and it does the same thing: I did get as far as being asked for my nickname, and then it started the same cyclical crashing.

Attachments

presenceservice.log (0.8 kB) - added by jg 22 months ago.
shell.log (2.1 kB) - added by jg 22 months ago.
Xorg.0.log (28.0 kB) - added by jg 22 months ago.
avahi.txt (2.2 kB) - added by blizzard 20 months ago.
stack trace of avahi abort on startup

Change History

Changed 22 months ago by jg

Changed 22 months ago by jg

Changed 22 months ago by jg

Changed 22 months ago by jg

  • priority changed from normal to high

Note this appears different than #793; I checked for the existence of /var/lib/stateless/writeable and it existed.

Also note that the wireless module might be bricked, as the board is one that Dave has had his sticky fingers on during the wireless update problems.

Changed 22 months ago by cjb

Also note that the wireless module might be bricked

That sounds likely; the error is as if avahi isn't running, and perhaps avahi refuses to run when there's no interface for it to bind to. Can you check whether the wireless is bricked using iwconfig/dhclient?

Changed 22 months ago by jg

Avahi is running: it says:

avahi-daemon: registering [localhost.local] avahi-daemon: chroot-helper

Changed 22 months ago by marco

I think avahi handle this fine. We used to have problems in sugar if no network interface was present though, I thought they was fixed but it's possible they regressed. It would be good to check if wireless is actually bricked. Also it would be good to attach the logs from ~/.sugar/default/logs

Changed 22 months ago by jg

I added all the logs I could lay my hands on, and attached them.

Did I miss one? If so, what do I need to get you?

Changed 22 months ago by marco

Oh sorry, I missed the logs at the top!

It looks like the presence service is failing to start. It's not clear why from the logs. Dan might know, will talk with him.

Changed 22 months ago by dcbw

can you reinstall 239 and re-verifiy? we've never seen this in testing at cambridge...

Changed 22 months ago by cjb

Jim, could you also check that you aren't out of disk space on /? I've just seen this failure mode on a machine that didn't have space to write sugar's log files.

Changed 21 months ago by jg

  • milestone changed from BTest-2 to BTest-3

We've seen this on several systems. Reinstalls may fix it.

Next time we see it, please put the machine aside so it can be debugged.

Changed 21 months ago by wad

It is not hard to find a machine that is exhibiting this behavior, I see it on about 20% of the first boots.

The situation is this:

A laptop is unwrapped from Quanta and powered up.

It prompts for the nickname, and the user enters one.

It then goes into a cycle of X going away (login prompt displayed), X starting up, then crashing again.

A reboot or power cycle at this point clears the problem up (it does not appear on subsequent boots.)

Deleting the /home/olpc/.sugar/default/config file and restarting may trigger the bug again (even on machines that didn't exhibit the behavior on first boot!)

Changed 21 months ago by wad

Responding to earlier comments:

This is not related to disk space, there is plenty.

This is also not related to network interfaces not coming up, as that is the bug I'm trying to reproduce... ifconfig shows the wireless interface up and DHCP'd.

Changed 20 months ago by jg

Do we still see this in 303 or later?

Changed 20 months ago by marco

  • milestone changed from BTest-3 to Trial-1

Changed 20 months ago by marco

Maybe we should just disable the presence service. It's not like the mesh view is doing anything useful at the moment. It would also save a bit of memory.

Dan? I can just rip it off tomorrow morning if that works.

Changed 20 months ago by jg

Chris Ball and I have seen what appears to be the same problem (sugar dying before doing anything) on builds 340 and 343. Reboot seems to clear it.

Changed 20 months ago by cjb

The problem is that avahi is exiting (with exit code 0) before sugar comes up, and we don't know why. Reboot *sometimes* works; my first two boots of 343 failed in this way.

Changed 20 months ago by marco

I disabled the presence service in git. It was just wasting memory anyway... I'll get this on the next image.

Changed 20 months ago by dcbw

  • owner changed from marco to J5

Sounds good; we should have pilgrim 'chkconfig --level 123456 avahi-daemon off' as well.

Changed 20 months ago by jg

OK, please open an Avahi bug, both here and upstream in Avahi's system. Mark it blocker for B3.

Also, make sure Trent Lloyd and Lennart Poettering know we're having serious trouble. I know for sure Trent has a B1 system (he got it on stage at LCA) and we can call in a chip or two. I know we've been intending to get Lennart a system if he's interested too.

Changed 20 months ago by blizzard

stack trace of avahi abort on startup

Changed 20 months ago by J5

Ok looking at the backtrace and sources it looks like this should be fairly easy to track down since g->register_time_event (the variable we are asserting on) is only ever used in avahi-core/entry.c

This is assuming we arn't getting stack corruption somewhere and the assert is just a side effect.

Looking at the code shows that we only assign register_time_event a value at the point directly after we assert and everywhere else we free it and NULL it out. This means we are calling avahi_s_entry_group_commit a second time before it can be cleared. Throwing in a custom avahi with print statements where we create and remove the register_time_event will most likely give us a good idea of why we are asserting.

Changed 20 months ago by lathiat

I believe the nature of the problem here is that sugar is registering services while avahi is in the COLLISION state, this should not happen, you should always wait until Avahi is in the RUNNING state, if it is not in the running state you should queue your object creation until it hits the running state.

This only happens on startup because it's in the middle of trying to resolve conflicts on the 'localhost' hostname.

Of course secondary to this there is a bug in Avahi where in this case it trips an assert(), this shouldn't happen and will also be fixed.

I had a look at the Sugar code and based on the way it's done it seems relatively trivial to fix this, so I plan to cook up a patch for it tonight when I'm back from work.

Changed 20 months ago by lathiat

Turns out the above was not true, while that was broken this issue was unrelated to that.

Fix for avahi here: http://www.avahi.org/changeset/1400

Changed 20 months ago by marco

  • status changed from new to closed
  • resolution set to fixed

The patch is in and I think dcbw tested it.

Note: See TracTickets for help on using tickets.