Ticket #9972 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

openfirmware writes fail on an ext2 filesystem with large_file feature set

Reported by: Quozl Owned by: wmb@…
Priority: normal Milestone: 10.1.1
Component: ofw - open firmware Version: Development build as of this date
Keywords: Cc:
Action Needed: add to build Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Context: OpenFirmware Q3A25 Q3A26, with an ext2 filesystem 1050000 sectors in size, using the Forth verbs copy and to-file .

Symptom: the verbs fail and display a Flushbuf error. With show-aborts on the extended output is:

<buffer@ff9cf498>:53: Flushbuf error

Analysis: if the size of the filesystem is reduced to 1040000 sectors, the symptom does not occur. Output from dumpe2fs shows a different structure is chosen by mke2fs at this size threshold.

Logs: serial console logs from the good and fail case, as well as the dumpe2fs output, have been mailed to Mitch. The test scripts that generate the filesystem with the boot/olpc.fth file on it can be found at:

git clone http://dev.laptop.org/~quozl/6210/.git/ browser http://dev.laptop.org/~quozl/6210/

Change History

  Changed 5 years ago by pgf

i've also gotten the Flushbuf error on a 1G USB stick (see comment 6 on #9957). here's the mkfs output for that filesystem:

==chalk,pgf(1)>> .s mke2fs /dev/sdf1
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
61056 inodes, 244087 blocks
12204 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=251658240
8 block groups
32768 blocks per group, 32768 fragments per group
7632 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376

Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done


  Changed 5 years ago by pgf

more info about the above filesystem:

==chalk,root(1)>> /sbin/parted /dev/sdf "unit s print"
Model: Kingston DataTraveler 2.0 (scsi)
Disk /dev/sdf: 1952768s
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start  End       Size      Type     File system  Flags
 1      34s    1952734s  1952701s  primary  ext2         boot 

==chalk,root(1)>> grep sdf /proc/partitions
   8       80     976384 sdf
   8       81     976350 sdf1

  Changed 5 years ago by wmb@…

  • status changed from new to assigned

The problem is EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER

In "classic" ext2, every block group contained a backup copy of the superblock and the group descriptors. With SPARSE enabled, only groups 0, 1, and powers of 3, 5, and 7 have backup copies.

OFW was trying to write backups to every group, and in doing so, overwrote inode bitmaps for those groups that should not have backups.

  Changed 5 years ago by wmb@…

  • next_action changed from never set to test in build

Fixed by svn 1657. Test in http://dev.laptop.org/~wmb/q3a26d.rom

  Changed 5 years ago by Quozl

  • version changed from not specified to Development build as of this date
  • milestone changed from Not Triaged to 1.5-software-later

  Changed 5 years ago by Quozl

  • milestone changed from 1.5-software-later to 1.5-software-update

  Changed 4 years ago by wmb@…

Deployed in q3a29.

  Changed 4 years ago by Quozl

Tested in Q3A29, symptom still occurs.

Simplified Reproducer:

  • on build os108 insert a USB stick of at least 1Gb capacity, with no content of value,
  • wipe all partitions and create one of 1050000 sectors, with an ext2 filesystem,
    parted --script /dev/sdZ mklabel msdos
    parted --script /dev/sdZ mkpart primary ext2 0 1050000s
    mke2fs -q /dev/sdZ1
    
  • insert into a laptop running Q3A29 and power up, hitting ESC to obtain ok,
  • obtain a directory listing and use to-file to create a file.

Output:

ok dir u:\
ext2-file-system
[... lost+found listed ...]
ok to-file u:\test cr
Flushbuf error
ok 

follow-up: ↓ 11   Changed 4 years ago by Quozl

Identical symptom occurs with an external SD card of 4Gb capacity that is partitioned and prepared in the same way.

ok to-file ext:\test cr
Flushbuf error

  Changed 4 years ago by Quozl

Retested on Q3A29d ... the above symptoms persist.

A possibly related symptom occurs on a 4GB USB stick without restricting the partition size:

OLPC D1, 1 GiB memory installed, S/N SHC93701192
OpenFirmware  CL1   Q3A29d Q3A   EC Firmware Ver:1.9.21

Type 'help' for more information.

ok p2
USB2 devices:
/pci/usb@10,4/scsi@1,0
/pci/usb@10,4/scsi@1,0/disk
USB1 devices:
ok .partitions u
Partition  Region   Boot  Format    Size (MB)

    1      Primary  No    ext2          3836
ok to-file u:\test cr
Can't open file
ok 

Different ticket?

in reply to: ↑ 9   Changed 4 years ago by Quozl

Retested on Q3A31, identical symptom occurs with an external SD card of 4Gb capacity with one partition filling the entire device.

ok to-file ext:\test cr
Flushbuf error

  Changed 4 years ago by Quozl

Progress update. Reproduced excluding to-file; the close-file word in the following throws the Flushbuf error:

0 value fd

: test
  s" ext:\bar" w/o create-file throw to fd
  s" data" fd write-file throw
  fd close-file throw
;

  Changed 4 years ago by Quozl

  • summary changed from openfirmware writes fail on an ext2 filesystem of certain size range to openfirmware writes fail on an ext2 filesystem with large_file feature set

Progress update.

unknown-extensions? returns non-zero for the test case. The cause is that the ext2 filesystem superblock struct member s_feature_ro_compat has bit EXT2_FEATURE_RO_COMPAT_LARGE_FILE set.

This corresponds to large_file as shown by dump2efs or tune2fs, but this bit cannot be set or cleared from Linux using mke2fs or tune2fs.

RH BZ 258381 suggests this was to be changed.

With incompatible filesystem features present, OpenFirmware uses nullwrite silently. Theory: this manifests later as Flushbuf error.

  Changed 4 years ago by wmb@…

  • next_action changed from test in build to add to build

Fixed by svn 1720.

follow-up: ↓ 17   Changed 4 years ago by Quozl

Tested with svn 1727.

ok 0 value fd
ok s" u:\test.txt" w/o create-file throw to fd
Can't open file

Yet to dig deeper.

  Changed 4 years ago by wmb@…

The create-file problem is not directly correlated to the large_file extension; I used tune2fs to set the large_file option on the boot partition on the internal SD, and the create-file command shown above worked correctly.

in reply to: ↑ 15   Changed 4 years ago by Quozl

Replying to Quozl:

Yet to dig deeper.

Symptom no longer reproduces. Can open files on a 4Gb external SD and 4Gb USB stick fine.

So this ticket is awaiting a release of svn 1720 or later.

  Changed 4 years ago by wmb@…

I may have fixed the problem with copy on 4G USB stick - try scp dev.laptop.org:~wmb/q3a33a.rom (do not go directly to 33b or 33c - they are tests for a different thing and do no have the copy fix).

The problem - I think - is that copy has the USB disk driver open twice simultaneously - once for the source file and once for the destination file. Each instance has a separate instance of the deblocker package open, and the deblocker caches some disk blocks for the purpose of block size conversion. If the two deblocker instances are both caching the same data, it can get out of sync leading to bad results.

I'm not entirely sure the fix in q3a33a.rom will be effective, but it has a fighting chance.

  Changed 4 years ago by wmb@…

Actually, try scp dev.laptop.org:~wmb/q3a33d.rom .

I found another problem related to miscalculation of the block numbers for backup group descriptors, and fixed it in 33d - which also has the multiple-open fix.

  Changed 4 years ago by Quozl

I tried with 33d and got directory content corruption then the abort on second copy:

ok dir u:\
ext2-file-system
---drwxr-xr-x      4096  2010-02-07 21:38:13  .
---drwxr-xr-x      4096  2010-02-07 21:38:13  ..
---drwx------     16384  2010-02-07 21:37:33  lost+found
---drwxr-xr-x      4096  2010-02-07 21:38:13  boot
----rw-r--r--         0  2010-02-07 21:38:13  touched
----rw-r--r--         6  2010-02-07 21:38:13  hello
----rw-r--r--        12  2010-02-07 21:38:13  hello-world
----rw-r--r--        29  2010-02-07 21:38:13  date
---lrwxrwxrwx         5  2010-02-07 21:38:13  hello-link -> hello
---drwxr-xr-x      4096  2010-02-07 21:38:13  directory
---lrwxrwxrwx         9  2010-02-07 21:38:13  directory-link -> directory
---?rw-r--r--         0  2010-02-07 21:38:13  fifo
---?rw-r--r--         0  2010-02-07 21:38:13  node
ok copy u:\hello u:\copy
ok dir u:\
ext2-file-system
---drwxr-xr-x      4096  2010-02-07 21:38:13  .
---drwxr-xr-x      4096  2010-02-07 21:38:13  ..
---drwx------     16384  2010-02-07 21:37:33  lost+found
---drwxr-xr-x      4096  2010-02-07 21:38:13  boot
----rw-r--r--         0  2010-02-07 21:38:13  touched
----rw-r--r--         6  2010-02-07 21:38:13  hello
----rw-r--r--        12  2010-02-07 21:38:13  hello-world
----rw-r--r--        29  2010-02-07 21:38:13  date
---lrwxrwxrwx         5  2010-02-07 21:38:13  hello-link -> hello
---drwxr-xr-x      4096  2010-02-07 21:38:13  directory
---?r-x-w-r-x        81  1973-09-21 06:21:52  directory-link
---?rwxrwxrwx 4294967295  2106-02-06 06:28:15  fifo
---lrwxrwxrwx         9  2010-02-07 21:38:13  node
---?rw-r--r--         0  2010-02-07 21:38:13  copy
ok copy u:\hello u:\copy
Overwrite u:\copy? [y/n]? y
1 attempt to destroy file system
1 attempt to corrupt superblock or group descriptor
ok banner
OLPC D3, 512 MiB memory installed, S/N SHC9500001C
OpenFirmware  CL1   Q3A33d Q3A   EC Firmware Ver:1.9.22

ok 

  Changed 4 years ago by wmb@…

Detailed analysis of the block data traffic on that USB stick showed several cases where the data that read from a block disagreed with the previous write to that block.

We tried two strategies to correct the problem, both to no avail:

* Changed the partition map so the first partition started at 4 MiB (was sector 1), so subsequent multi-block writes were better aligned with respect to internal NAND pages * Reduced "max-transfer" in the USB disk driver node to 4 KiB (from 16 KiB)

  Changed 4 years ago by wmb@…

  • status changed from assigned to closed
  • resolution set to fixed

I think this "USB data read back is not what was written" is the same problem as #10067, so I'm closing this one. This ticket has already tracked several different issues and is getting confusing. The title no longer conveys the remaining problem.

Note: See TracTickets for help on using tickets.