Ticket #5494 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Uruguay laptops fail to launch activities; datastore corruption

Reported by: kimquirk Owned by: tomeu
Priority: blocker Milestone: Update.1
Component: sugar-datastore Version:
Keywords: Cc: jg, krstic
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

This was reported by Ivan from the technical people in Uruguay. About 200 laptops that were working fine, stopped being able to launch activities and when they clicked on the journal, there was nothing in it.

This problem can only be resolved by re-imaging the OS.

Change History

  Changed 7 years ago by kimquirk

Ivan found the problem (although not the trigger) and has created a patch for Uruguay. This same patch needs to get in a build and get out to all laptops.

  Changed 7 years ago by krstic

Note there are two patches: one to the datastore, one to sugar.

  Changed 7 years ago by krstic

Note also that reimaging was used to work around this until my patches became available. UY will be pushing out those patches to existing machines without causing an upgrade or a reflash, the end state being 642+fixes that doesn't exist as a real build.

  Changed 7 years ago by tomeu

  • cc krstic added

We have both fixes in a ship.2 build. Should we close this now or after some testing is done?

http://xs-dev.laptop.org/~cscott/olpc/streams/ship.2/build651/

  Changed 7 years ago by marco

We should make sure the fixes are in Update.1 before closing.

  Changed 7 years ago by kimquirk

Here is the testing that Ivan outlined:

  • create journal items
  • open a console:

echo -n > /home/olpc/.sugar/default/datastore/store/index/config

  • This should cause serious problems to the datastore
  • reboot sugar
  • open journal and see that the items are still there

Another form of corruption can be created with:

dd if=/dev/nrandom bs=1M of=/home/olpc/.sugar/default/datastore/store/index/config count=1

  • reboot sugar
  • in this case the journal should have nothing in it; but activities can load and you can continue. The corrupt datastore has been moved out of the way.

  Changed 7 years ago by kimquirk

Note the beginning of the dd line:

dd if=/dev/random bs...

I create this corruption on a 650 laptop; the journal would not load and activities would not load. I upgraded (olpc-update) to 653; and the journal continues to not load.

follow-up: ↓ 9   Changed 7 years ago by marco

I can confirm. Journal starts working again only if you restart Sugar another time.

I haven't investigated this deeply but it seem like the second mount (after the move) tries to reuse a dead connection (since the corrupted config made the datastore crash).

in reply to: ↑ 8   Changed 7 years ago by tomeu

Replying to marco:

I haven't investigated this deeply but it seem like the second mount (after the move) tries to reuse a dead connection (since the corrupted config made the datastore crash).

I still have to look into this, but the corrupted config shouldn't make the datastore crash, just raise an exception and abort the Mount() call. Something could not be cleaned up after the exception, that seems more probable to me.

  Changed 7 years ago by krstic

  • status changed from new to closed
  • resolution set to fixed

Not fixing the situation without a Sugar restart is expected behavior for the patch and should not be fixed. Closing this bug, as the patches are now a part of ship.2-653.

follow-up: ↓ 12   Changed 7 years ago by tomeu

From this report:

http://lists.laptop.org/pipermail/devel/2008-January/009313.html

looks like the config file is left empty when force-killing the datastore process. In the field I expect this to be caused by the laptop running out of battery.

in reply to: ↑ 11   Changed 7 years ago by krstic

Replying to tomeu:

looks like the config file is left empty when force-killing the datastore process. In the field I expect this to be caused by the laptop running out of battery.

This was my first suspicion, and the first thing I tested when originally debugging the problem. Neither a test of force-killing the DS nor a code inspection allowed me to reproduce it, though. Did you have different results? I'd love to track this down.

Note: See TracTickets for help on using tickets.