Ticket #3978 (closed defect: fixed)

Opened 7 years ago

Last modified 6 years ago

Journal getting blank

Reported by: amitgogna Owned by: AlexL
Priority: blocker Milestone: Update.1
Component: sugar-datastore Version: Development build as of this date
Keywords: Journal Blank Out Cc: carla@…, tomeu, bcsaller, jg, kimquirk, arjs
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

The Journal was seen getting blanked out on some XOs having tried on the following builds :-

Build : 608 , 601

Firmware : Q2C26

But the Journal Activities could be seen while inserting a image from Write Activity.

Attachments

Journal_blank.JPG (46.5 kB) - added by Carla 7 years ago.
blank Journal
Inserting_Abiword.JPG (52.5 kB) - added by Carla 7 years ago.
Inserting Abiword
datastore.log.zip (10.9 kB) - added by Carla 7 years ago.
sugar.tgz (17.7 MB) - added by Carla 7 years ago.

Change History

  Changed 7 years ago by jg

  • version changed from BTest-4 Hardware to Development build as of this date

What do you mean "blanked out"?

  Changed 7 years ago by Carla

Hi, Today we got suddenly 4 XOs with the blank Journal, we still haven't found out how it happens. I'm attaching two photos: 1) the Journal (totally blank), and 2) same XO but inserting an image from Abiword showing that there are files in the XO. We have been using them at Khairat school. Three of them with the 601 image and mine with 608.

Changed 7 years ago by Carla

blank Journal

Changed 7 years ago by Carla

Inserting Abiword

  Changed 7 years ago by jg

  • cc tomeu, bcsaller added
  • owner changed from wad to benzea
  • priority changed from high to blocker
  • verified unset
  • milestone changed from Never Assigned to Trial-3

  Changed 7 years ago by jg

  • cc jg, kimquirk added

follow-up: ↓ 9   Changed 7 years ago by tomeu

  • owner changed from benzea to tomeu

Carla, can you attach the datastore and journal logs?

  Changed 7 years ago by jg

  • status changed from new to closed
  • resolution set to duplicate

This looks like a duplicate of #3372.

  Changed 7 years ago by marco

Doesn't look like a dup to me. I think either the DS is faling or backtracing on the query. With the logs it should be easy to figure out what is going on.

  Changed 7 years ago by marco

  • status changed from closed to reopened
  • resolution deleted

in reply to: ↑ 5   Changed 7 years ago by Carla

I'm attaching all the related logs that I found. (the ones that start with t0 were saved before the all second batch)

Changed 7 years ago by Carla

follow-up: ↓ 11   Changed 7 years ago by tomeu

Carla, I suspect you are hitting #3876.

Could you please try again without any usb stick plugged in?

in reply to: ↑ 10   Changed 7 years ago by Carla

Replying to tomeu:

Carla, I suspect you are hitting #3876. Could you please try again without any usb stick plugged in?

I don't have wireless Internet access to send you the files from the XO. So, after turning the XO on with no usb sticks plug on it, I mounted the usb stick to copy the log files and umount it after it.

I was re-doing the whole process again. And after unmounting the usb key. It showed me the bash line (as if was unmounted) then I got error messages that it was not unmounted properly. I turned it off and on again and voila! the Journal works,

[I can say that you are really good! now you are fixing remote bugs through telepathy]

but I don't know how to replicate the solution in the other XOs that have the blank Journal.

follow-up: ↓ 14   Changed 7 years ago by Carla

Bad news, after several times of re-starting it. I got the blank Journal again. To copy the files, I plug again the usb, got same error messages, the files at the usb are corrupted, and the Journal remained blank.

follow-up: ↓ 17   Changed 7 years ago by tomeu

I don't know yet what's going on, but please test the following: in a machine where this is still happening, click somewhere on the blank area and see if the screen shows the detailed view of the entry.

If it doesn't happen at the first tentative, try several times in other parts of the white area.

in reply to: ↑ 12 ; follow-up: ↓ 18   Changed 7 years ago by tomeu

Replying to Carla:

Bad news, after several times of re-starting it. I got the blank Journal again. To copy the files, I plug again the usb, got same error messages, the files at the usb are corrupted, and the Journal remained blank.

What do you mean when you say that the files at the usb are corrupted?

follow-up: ↓ 19   Changed 7 years ago by marco

Carla, are these B4 or B2? How much memory do they have?

The /home/olpc/Journal stuff you have on one of them must come from a very early image. If they was B2 the problem might be explained by low memory...

  Changed 7 years ago by tomeu

Carla, as we are quite clueless about what's going there, I'm going to ask you to dump the jffs2 image of one of the laptops that show the problem.

You can do so by following the instructions in here: http://wiki.laptop.org/go/Customizing_NAND_images

This will create a quite big file that we'll be able to flash on one of our laptops and inspect what's happening there.

When the file is created on a usb stick, please upload it somewhere and add here the link (or send by email).

Ask in #olpc or #sugar if you need anything.

in reply to: ↑ 13   Changed 7 years ago by Carla

Replying to tomeu:

I don't know yet what's going on, but please test the following: in a machine where this is still happening, click somewhere on the blank area and see if the screen shows the detailed view of the entry. If it doesn't happen at the first tentative, try several times in other parts of the white area.

I clicked all over and nothing happened

in reply to: ↑ 14   Changed 7 years ago by Carla

Replying to tomeu:

What do you mean when you say that the files at the usb are corrupted?

I tried to open them in my machine as I did with the previous once I sent you and I got messages saying that the file was corrupted and didn't open

in reply to: ↑ 15 ; follow-up: ↓ 22   Changed 7 years ago by Carla

Replying to marco:

Carla, are these B4 or B2? How much memory do they have? The /home/olpc/Journal stuff you have on one of them must come from a very early image. If they was B2 the problem might be explained by low memory...

As Amit mentioned when he opened the file, they are all B4's with CL1 Q2C26 Q2C Mine had 608, the other three 601. Mine had like 4 files total, and the children had been with the laptops for 2 days and a half by the time we started getting blank Journals. 256MiB memory.

I don't think that the memory is full. In either case, we should not get blank Journals when there's no more space on the laptop.

follow-up: ↓ 21   Changed 7 years ago by marco

Looks like this piece of code is taking between 25 and 50 seconds to complete:

ds = DataStore?() ds.registerBackend(backingstore.FileBackingStore?) ds.registerBackend(backingstore.InplaceFileBackingStore?) ds.mount(repo_dir) ds.complete_indexing()

Now I don't know if this has to be blamed on the DS, on a lower level software issue or even on the hardware. I think this is usually immediate.

The DS should not be doing anything that can take time before going in the main loop but still I think we need to track down why it's taking so long, there might be a serious issue somewhere.

in reply to: ↑ 20   Changed 7 years ago by Carla

I'll be sending the nand.img as soon as it finishes copying it

in reply to: ↑ 19   Changed 7 years ago by marco

Replying to Carla:

I don't think that the memory is full. In either case, we should not get blank Journals when there's no more space on the laptop.

Sure, I was asking that information because it was useful to track down the issue, not because I thought it would have been the right behavior.

  Changed 7 years ago by marco

The time the block I pasted above is taking seem to depend on the number of entries in the datastore. It's taking ~ 0.1 for 0 entries, 1.8 for 160. I'm not sure if it grows linearly but if it does we have a problem there, independently from what cause Carla problem (HoboPrimate issue might have been caused just by having a bunch of entries in the Ds, for example).

follow-up: ↓ 25   Changed 7 years ago by tomeu

<carlagm> tomeu: it finished copying it but it may not be useful ....after a long time saving I got "Dumping to disk:\nand.img" "7ffc0 -" "Done" "Flushbuf error" "ok" I turned off after and check the files on the usb the NAND.IMG is 0 KB!!! the other files under OLPC~1.STO are in total 2.5KB....what should I do?

Yeah, I think that the usb stick is full. You would need to restart the process using an usb stick with more than 512MB of free space (perhaps much more, don't know the details of the process).

Changed 7 years ago by Carla

in reply to: ↑ 24   Changed 7 years ago by Carla

Please let me know if the .sugar is enough or if you need us to try the ssh for the nand.img tomorrow morning (Mumbai time)...

  Changed 7 years ago by marco

The .sugar seem to work fine on my XO.

Yeah, we will need the nand.img. If even that can't reproduce the issue we are going to be in troubles.

  Changed 7 years ago by Carla

ok, in btw 9 -10 hours from now, I'll connect again into the irc channel....but if there's a procedure that we have to follow, just let us know in advance...talk to you soon

  Changed 7 years ago by marco

Image is here: https://dev.laptop.org/~marco/nand.img

Carla, I reproduced the issue. As a temporary work around you can press "ctrl+alt+del" once the journal is started. Sugar will restart and the journal will work.

  Changed 7 years ago by marco

  • component changed from journal-activity to kernel

It looks like a jffs2 related issue. For approx 25 seconds after boot it's not possible to access the /home/olpc/.sugar/default/datastore/store/index directory, even just ls on that dir blocks.

The datastore will need to handle this situation better in the future but something seem wrong at the kernel level, so I'm reassigning it. I have a machine with the image installed if you want me to test stuff or provide further information.

If the file system behavior is expected or if you think it would be safer to fix the datastore to deal with the situation better for Trial-3 please reassign back to me.

  Changed 7 years ago by jg

  • owner changed from tomeu to dwmw2
  • status changed from reopened to new

follow-ups: ↓ 32 ↓ 33   Changed 7 years ago by dwmw2

It looks like these database files are triggering pessimal behaviour from JFFS2. I'll see what I can do to improve it, but I think it might be a lot better if we can make better use of _separate_ files for data storage, rather than large files which get a lot of internal 'churn'.

Is there a description of the storage format somewhere?

in reply to: ↑ 31   Changed 7 years ago by tomeu

Replying to dwmw2:

It looks like these database files are triggering pessimal behaviour from JFFS2. I'll see what I can do to improve it, but I think it might be a lot better if we can make better use of _separate_ files for data storage, rather than large files which get a lot of internal 'churn'.

Right. We were doing some very stupid things and we have just pushed fixes for them. During this weekend we'll test those. If performance is good enough for Trial3 (we hope so), we can start from there in order to improve performance further for FRS and later.

The data that is indexed, searched and can be used for filtering and sorting should be in those btree files. We are also storing the previews for each entry but as payload-only data, those are stored in the record.DB file. We could store those in separate files, but we would prefer to only do so if there is a measurable improvement.

Is there a description of the storage format somewhere?

http://www.xapian.org/docs/admin_notes.html http://wiki.xapian.org/FlintBackend

in reply to: ↑ 31 ; follow-up: ↓ 34   Changed 7 years ago by tomeu

Replying to dwmw2:

One more thing: on those machines where the jffs2 gc thread uses all the cpu for 25 seconds, rebooting after the thread has returned to sleep won't make that issue go away.

After rebooting the gc thread will get busy for the same amount of time. I thought that once the thread has finished gc'ing, the next boot would be faster. Is that right?

in reply to: ↑ 33   Changed 7 years ago by dwmw2

Replying to tomeu:

Replying to dwmw2: One more thing: on those machines where the jffs2 gc thread uses all the cpu for 25 seconds, rebooting after the thread has returned to sleep won't make that issue go away. After rebooting the gc thread will get busy for the same amount of time. I thought that once the thread has finished gc'ing, the next boot would be faster. Is that right?

When the GC thread runs after boot, it's not actually performing garbage collection. It's just finishing the mount process, by checking the CRC on all data nodes and building the full 'fragment tree' for each file so that we know for sure which nodes are valid, and which are obsolete. Once upon a time we used to do this _during_ the mount, and the mount itself was much slower -- but we realised we could defer it and do it in the background after mounting, while read-only operation (and even writes, if there's enough actually free space) can continue before it's finished.

On NAND flash, it's not possible to 'scribble' on obsolete nodes as it used to be on NOR flash. So when we first open a file, we have to consider each of the old nodes. Once the file has been opened once (by actual userspace activity or by the GC thread touching it in its scan after mount), we'll _remember_ that most of the nodes are obsolete -- and the _next_ time the file is opened it'll be a lot quicker.

One of the options for improving JFFS2 behaviour in this situation is to make it more aggressive about garbage-collecting the eraseblocks containing the offending obsolete nodes. Currently, we tend to pick eraseblocks with the most obsolete space -- but perhaps the _number_ of nodes in the eraseblock should also be a criterion.

  Changed 7 years ago by dwmw2

In the offending image, the file 'postlist.DB' has 189092 nodes (log entries), of which 184065 are obsolete. When opening it for the first time, we spend quite a lot of time working out that in fact we can throw away most of the nodes.

  Changed 7 years ago by tomeu

Dave, how can we measure how well are we playing with jffs2?

By measuring the number of obsolete blocks about one file? Perhaps other metrics? How can we gather them?

  Changed 7 years ago by Carla

  • cc arjs added

  Changed 7 years ago by dwmw2

The only way to gather such metrics at the moment is to hack the kernel code to emit them. I'll see if I can come up with something better. It would probably be too intrusive to make the kernel report such information -- but a userspace tool which reads the flash directly through the /dev/mtd0 device might be quite feasible. I'll take a look.

  Changed 7 years ago by dwmw2

This appears to be mostly fixed by the commit at http://git.infradead.org/?p=mtd-2.6.git;a=commitdiff;h=8fb870df5a1f261294b833dd807bcba3bacface6

It triggers garbage collection not only when we actually need the space, but also when there are a large number of flash eraseblocks which contain a _majority_ of obsolete data.

With regard to metrics -- on the whole, I'd prefer that userspace didn't _have_ to care. As a general-purpose file system, JFFS2 should cope relatively gracefully with just about anything you care to throw at it.

  Changed 7 years ago by marco

In #4073 this got a lot worst.

  Changed 7 years ago by dilinger

This fix made it into build 616; can someone please verify that it's fixed for them?

  Changed 6 years ago by kimquirk

  • component changed from kernel to datastore
  • milestone changed from Update.2 to Update.1

  Changed 6 years ago by kimquirk

  • owner changed from dwmw2 to marco

I believe this was fixed by marco, tomeu, ivan... marco - can you confirm? Is this the bug that occurs when there are many, many items in the journal?

If you think this is fixed and should get some specific testing, can you put in some notes about testing and re-assign to AlexL.

follow-up: ↓ 46   Changed 6 years ago by marco

  • owner changed from marco to AlexL

The datastore is now being a lot nicer with the filesystem, so we think it should not trigger the jffs2 problem anymore. Also I think Dave jffs2 patch landed in the kernel. Finally if the problem had to appear again it would not cause a blank journal but just slow down sugar startup a lot.

So yeah, we think this is fixed. It would be good to get some testing but I'm not sure how to try to reproduce it. This was reproduced consistently in the field, with kids apparently using the machine and the journal intensively (they had ~200 items in the journal after a couple days of use).

Alex, you could try to something like this:

  • Get the latest joyride build and flash it with no backups.
  • Time sugar startup (from X startup to when the journal icon stop blinking)
  • Open a bunch of activities.
  • Take a lot of photos in Record.
  • When you have at least 200-300 entries in the journal (you can count files in the DS dir) do a reboot.
  • See if sugar startup has slowed down consistently (it was ~25 seconds slower when the problem has been reproduced in the field).

  Changed 6 years ago by arjs

When this was happening in the fields - there was a mix of entries with Etoys projects dominating - and entries of record being the next in number - don't know how helpful that'd be.

The first observable effects of Journal getting blank were observed after a few days of initially flashing the image - perhaps this is when datastore must've had a few hundred entries.

in reply to: ↑ 44   Changed 6 years ago by AlexL

Replying to marco:

The datastore is now being a lot nicer with the filesystem, so we think it should not trigger the jffs2 problem anymore. Also I think Dave jffs2 patch landed in the kernel. Finally if the problem had to appear again it would not cause a blank journal but just slow down sugar startup a lot. So yeah, we think this is fixed. It would be good to get some testing but I'm not sure how to try to reproduce it. This was reproduced consistently in the field, with kids apparently using the machine and the journal intensively (they had ~200 items in the journal after a couple days of use). Alex, you could try to something like this: * Get the latest joyride build and flash it with no backups. * Time sugar startup (from X startup to when the journal icon stop blinking) * Open a bunch of activities. * Take a lot of photos in Record. * When you have at least 200-300 entries in the journal (you can count files in the DS dir) do a reboot. * See if sugar startup has slowed down consistently (it was ~25 seconds slower when the problem has been reproduced in the field).

Unfortunately, Record does not currently work in Joyride (currently 184), so this test will need to wait for that bug (4422) to be fixed. Also, rebooting just X, using ctrl alt bs doesn't work right now (4561), so timing it is sort of hard. What should I use as a marker to begin timeing?

  Changed 6 years ago by AlexL

Build Joyride 184 q2d03

I filled the journal with 201 Items, most of which were Clipboard object Images. The start up time was ~26 seconds before this and ~28 after, so there was no real change. I will continue to add items to this laptop, and test again when I'm past 300 or 400 items.

follow-up: ↓ 49   Changed 6 years ago by AlexL

Is it possible that the index or metadata in the datastore/store were corrupted or got messed up in some way? If they could view the journal entries from the write activity, the datastore/store probably still contained all there files, but for some reason the journal wasn't displaying them.

in reply to: ↑ 48   Changed 6 years ago by tomeu

Replying to AlexL:

Is it possible that the index or metadata in the datastore/store were corrupted or got messed up in some way? If they could view the journal entries from the write activity, the datastore/store probably still contained all there files, but for some reason the journal wasn't displaying them.

The files that stored the index info were so messed that it took a longtime to read. So long that the journal connection to the DS timed out and it wouldn't recover from that time out.

Activities started afterwards could connect to the DS faster, so it worked fine.

I think we understand fully what happened and this can be closed, as we already took the needed actions (launching the DS with more precedence, making more robust the filesystem, and making saner the writes to the DS index).

  Changed 6 years ago by marco

  • status changed from new to closed
  • resolution set to fixed

I agree. We tracked down all the issues and solved all of the at several levels. Alex did some testing and he was not able to reproduce. So I think we can close this one.

Note: See TracTickets for help on using tickets.