Ticket #216 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Kernel doesn't work w/some USB HDDs

Reported by: mfoster Owned by: davidz
Priority: high Milestone: BTest-1
Component: kernel Version:
Keywords: Cc: cjb
Action Needed: Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

Somewhere along the line in the kernel, someone got impatient, and apparently reduced the delays after resetting USB peripherals midway through the boot process. As a consequence, the kernel can't find the root on such devices, since /dev/sda1 isn't attached until several seconds *after* the kernel gives up and burrows into the initrd shell. Specifically, this was noted on the 1.8" Seagate ST650211U-RK 5.0GB drive. Ugh.

FYI! MarkF

Attachments

console log.txt (9.3 kB) - added by mfoster 7 years ago.

Change History

follow-up: ↓ 2   Changed 7 years ago by jg

  • owner changed from jg to davidz

Actually, something is funny here; the new system Dave Zeuthen built is supposed to wait arbitrary amounts of time, as I understand it, rather than relying on a fixed arbitrary timeout.

in reply to: ↑ 1   Changed 7 years ago by davidz

  • status changed from new to assigned

Replying to jg:

Actually, something is funny here; the new system Dave Zeuthen built is supposed to wait arbitrary amounts of time, as I understand it, rather than relying on a fixed arbitrary timeout.

Try removing the 'quiet' option from /boot/olpc-boot.sh and paste the output? Also, what is the size of the initramfs, e.g. something like

# ls -l /boot/initrd*

Thanks!

Changed 7 years ago by mfoster

  Changed 7 years ago by mfoster

In addition to the serial output (attached above), the tail of the screen output is:

starting udevd creating devices waiting for system to settle no root yet, udev rule will write symlink...

ls /dev/root: No such file or directory Bug in initramfs /init detected. Dropping to a shell. Good luck!

bash: no job control in this shell bash-3.1# scsi 0:0:0:0: Direct-Access SEAGATE ST650211USB 4.02 PQ: 0 ANSI: 2 SCSI device sda: 9757520 512-byte hdwr sectors (5001 MB) sda: Write Protect is off sda: assuming drive cache: write through SCSI device sda: 9757520 512-byte hdwr sectors (5001 MB) sda: Write Protect is off sda: assuming drive cache: write through

sda: sda1

sd 0:0:0:0: Attached scsi removable disk sda

ls -l /boot/initrd* on another system yields: [root@localhost boot]# ls -l initrd* -rw-r--r-- 1 root root 2507462 Oct 20 13:42 initrd-2.6.18-1.2711.olpc1.img lrwxrwxrwx 1 root root 30 Oct 21 20:56 initrd.img -> initrd-2.6.18-1.2711.olpc1.img [root@localhost boot]#

Cheers! MarkF

  Changed 7 years ago by dilinger

Verified here with build130. Note that LB is what's screwing up here:

OLPC ROM rev_a_20060926-1
Build timestamp: 20061024 19:52:27
GIT: 41b1593a261c15f70390631b3c285e401aa974c4+local
Starting bootmenu.
Please press ESC for the menu..usb 1-1: new full speed USB device using ohci_hcd and address 2
usb 1-1: configuration #1 chosen from 1 choice
scsi0 : SCSI emulation for USB Mass Storage devices
usb 1-2: new low speed USB device using ohci_hcd and address 3
usb 1-2: configuration #1 chosen from 1 choice
input: HID 1241:1203 as /class/input/input0
input: USB HID v1.11 Keyboard [HID 1241:1203] on usb-0000:00:0f.4-2
input: HID 1241:1203 as /class/input/input1
input: USB HID v1.11 Device [HID 1241:1203] on usb-0000:00:0f.4-2
.usb 1-4: new full speed USB device using ohci_hcd and address 4
usb 1-4: configuration #1 chosen from 1 choice
...timeout.
NOTICE: Booting default
scsi 0:0:0:0: Direct-Access     SEAGATE  ST650211USB      4.02 PQ: 0 ANSI: 2
SCSI device sda: 9767520 512-byte hdwr sectors (5001 MB)
sda: Write Protect is off
sda: assuming drive cache: write through
SCSI device sda: 9767520 512-byte hdwr sectors (5001 MB)
sda: Write Protect is off
sda: assuming drive cache: write through
 sda: sda1
sd 0:0:0:0: Attached scsi removable disk sda
argc is 3, argv[1] /flash/boot/vmlinuz, argv[2] ro quiet root=mtd0 rootfstype=jffs2 console=ttyS0,115200 console=tty0 fbcon=font:SUN12x22 pci=nobios video=gxfb:1024x768-16, argv[3] (null)
Starting new kernel
Linux version 2.6.18-1.2711.olpc1 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc version 4.1.1 20060926 (Red Hat 4.1.1-26)) #1 Thu Sep 28 16:15:38 EDT 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 0000000000000734 (reserved)
 BIOS-e820: 0000000000000734 - 00000000000a0000 (usable)
 BIOS-e820: 0000000000100000 - 0000000007700000 (usable)
119MB LOWMEM available.

Note that by the time LB launches the kernel, it's already determined that it should be booting off nand flash.

  Changed 7 years ago by dilinger

Ok, the problem I'm seeing is LB preferring nand flash over external USB devices, combined w/ it booting off cafe nand flash (I had no idea LB even supported the cafe's nand..). Mark, if that's the same problem you have, then we should change this bug to something like "LB should prefer external USB devices to internal flash". The easiest way to check whether it works for you is to manually select a USB device from LB when it's booting.

  Changed 7 years ago by jg

  • status changed from assigned to closed
  • resolution set to fixed

Also seems to be fixed.

  Changed 7 years ago by blizzard

  • status changed from closed to reopened
  • resolution deleted

Re-opening.

  Changed 7 years ago by blizzard

David has a link to the olpc start script. For reference purposes:

http://david.woodhou.se/olpc-init.txt

He's added this sleep:

/sbin/udevsettle --timeout=30 echo Sleeping for 10 seconds after udevsettle... /bin/sleep 10 echo slept.

But other than that it's the same as the original script.

  Changed 7 years ago by JordanCrouse

The LAB issue that dillinger posted is due to stupidity in the logic of LAB - probably not related to the original problem (FYI).

AMD always recommends an arbitrary 10s timeout for mounting USB keys as root - USB storage hardware takes a long time to come up anyway, and the Linux kernel doesn't do us any favors in this regard. I appreciate the attempt to have a more intelligent mount process, but sometimes the easiest approach is also the best.

Though I agree with Chris Blizzard's assertion that we really need to understand whats going on under the covers with this. I do believe that this is fully the fault of badly behaving software.

follow-up: ↓ 12   Changed 7 years ago by blizzard

Mark has said this happens once in a while over a period of hundreds of reboots. Do we have the equipment to test that here in Cambridge so we can debug as well? We'll need to instrument both the kernel and the hardware at some point.

  Changed 7 years ago by jg

  • cc cjb@… added

We can easily set up a tinderbox to repetitively reboot a drive.

Key, however, is having a device (type) believed to exhibit the problem.

We have 6gig seagate drives, but I don't know if we have any of the identical 5 gig seagates that Mark saw this on.

in reply to: ↑ 10   Changed 7 years ago by cjb

Replying to blizzard:

Mark has said this happens once in a while over a period of hundreds of reboots. Do we have the equipment to test that here in Cambridge so we can debug as well? We'll need to instrument both the kernel and the hardware at some point.

We have the infrastructure, but we have done and are doing hundreds of reboots, and haven't seen this bug yet. We regularly do reboot cycles overnight.

  Changed 7 years ago by blizzard

Some notes from Mark:

21:14 < MarkF> USB on this machine involves a complex sequence of software.  
               Instead of having a nice single USB master, we have firmware, 
               O.S., drivers, and enumerators all talking to USB.  Each of 
               these codebases wants to reset the USB bus.  When the bus is 
               reset, the core problem that seems to be appearing, verified by 
               USB protocol analyzer, is that the devices aren't ready when he 
               code wants them to be.
21:15 < MarkF> Following the USB reset, even though it seems that devices like 
               USB Flash keys ought to be ready instantly, they are not.  They 
               sometimes take more than 5 seconds to reset. It's hard to see 
               this, because the timing obviously depends on the internal state 
               of their filesystem.
21:15 < MarkF> The major symptoms of these "drive not ready" problems include: 
               * OFW missing the USB key and booting NAND Flash instead
21:16 < MarkF> * The kernel not being able to mount root, and dropping into an 
               initrd shell saying "No job control for this shell"
21:16 < MarkF> * Kernel panics (we've seen several types)
21:17 < MarkF> * Seagate HDDs that sometimes don't work
21:17 < MarkF> * Hangs at "Starting New Kernel"

  Changed 7 years ago by blizzard

Last night Mark sent us a set of traces that helped us track down what we believe the problem is. The script that starts up the machine is set up to handle waiting for devices to get ready for up to 60 seconds after the device appears on the bus. However, it appears that there was a bug in that code that would actually cause the script to terminate before the device appeared.

It's a one line fix and we're making a build now to test our hypothesis.

  Changed 7 years ago by blizzard

OK, try build141. It contains this fix.

  Changed 7 years ago by blizzard

  • status changed from reopened to closed
  • resolution set to fixed

David Woodhouse reports this is fixed with build141. Thanks for all the hard work, guys!

  Changed 7 years ago by cjb

  • cc cjb added; cjb@… removed
Note: See TracTickets for help on using tickets.