Ticket #9100 (closed defect: fixed)

Opened 6 years ago

Last modified 4 years ago

2.6.27+ early boot crashes on XO1

Reported by: dsd Owned by: wmb@…
Priority: blocker Milestone: 10.1.2
Component: ofw - open firmware Version: 1.5 Software Build os206 aka 10.1.1
Keywords: Cc: mtd, wmb@…, dsaxena, glowcohen, hal.murray, bernie
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

After upgrading from 8.2 to joyride, I see boot failures quite often. Reproduced on 2 XOs.

It hangs on the final stage of OFW output, where it talks about "loading ramdisk" etc. The frame key does not invoke OFW debugger, so I think this means we have entered and crashed in kernel land.

I don't have serial console kit at the moment.

Attachments

olpc.fth (2.3 kB) - added by dsd 4 years ago.
olpc.fth that adds "Leaping" message boot hook

Change History

  Changed 6 years ago by dsaxena

  • milestone changed from 8.2.1 to 9.1.0

2.6.27 is the kernel for 9.1, not for 8.2.1. Changing milestone.

  Changed 5 years ago by dsd

  • summary changed from 2.6.27 early boot crashes to 2.6.27+ early boot crashes

My findings so far

  • nothing is output from the kernel on the serial console, so it is a very early freeze from the kernel standpoint
  • the serial console does show a newline character *after* the "loading ramdisk" line, which seems to indicate that the ramdisk does load correctly and the crash is triggered by something that happens after
  • the OFW frame key does not cause the debugger to start
  • waiting a few minutes does seem to increase success rate - I have had very little success with on,off,on,off,on,off type cycles, but if you insert a few minutes wait inbetween then it does feel to be more likely to boot
  • once booted, you can reboot indefinitely and it won't hang
  • I discount reports that this is related to external devices, swap partitions, etc. I run my XOs with the standard configuration and no external devices and I frequently see this bug
  • I've seen this ever since we upgraded kernel after 8.2
  • the 8.2 kernels continue to work fine

  Changed 5 years ago by mtd

  • cc mtd added

  Changed 5 years ago by dsd

  • milestone changed from 9.1.0-cancelled to 1.0-software-update

  Changed 5 years ago by dsd

I still get this regularly on 2.6.31

  Changed 5 years ago by bernie

We're also seeing this quite often just after we upgrade to our F11-XO1 release.

On some laptops, the problem persists across reboots. In two cases, I could resolve the issue by pulling the AC and battery, which points at the EC.

I don't have a serial cable here, and anyway these laptops probably don't have the connector soldered on the board.

  Changed 4 years ago by dsd

  • cc wmb@…, dsaxena added
  • summary changed from 2.6.27+ early boot crashes to 2.6.27+ early boot crashes on XO1

Mitch, I know you're busy with other projects, but perhaps you have some suggestions on the next step of debugging this issue?

Summary: all kernel versions newer than the one in the last official release (8.2) crash fairly frequently during boot on XO-1 in a mysterious manner. OFW is still on-screen at the time of the crash, with the line "Loading ramdisk" being the last thing there. Frame key does not invoke OFW debugger. Nothing more is output on serial console.

  Changed 4 years ago by dsaxena

Daniel, one option is to use early_printk() to get output before the full console system is running. You can enable it with earlyprintk=ttyS0 and then hack kernel/printk.c:printk() to just call early_printk(). I haven't tested this yet but I'm fairly certain it will work and let you see how far into the boot we're getting.

  Changed 4 years ago by dsd

Haven't seen this in a week of continued XO-1 hacking, maybe it's gone away. Let's see what community testing of the upcoming build brings out.

  Changed 4 years ago by dsd

There have been 2 independent reports of boot-time hangs, but in both cases pretty boot was enabled so it's hard to say whether it is this bug or another. For the next release we have disabled pretty boot to help diagnosis.

  Changed 4 years ago by dsd

In #10217 this was reproduced on 10.1.2 build 300 with firmware Q2E44.

  Changed 4 years ago by dsd

  • cc glowcohen added

  Changed 4 years ago by Quozl

  • next_action changed from never set to diagnose
  • version changed from not specified to Development build as of this date
  • milestone changed from 1.0-software-update to 10.1.2

  Changed 4 years ago by glowcohen

Installed os301, had the boot hang reoccur. The output was the same as with the game check key on os300, stopped at the same point. Needed to retry multiple times to get successful boot.

  Changed 4 years ago by dsd

  • cc hal.murray added

Hal, glowcohen:

It seems like you can both reproduce this bug with ease (on a daily basis?). I've spent a few hours powering up and down a group of XOs without success, even though I have seen this bug appear many times in the past (when I wasn't hunting it...)

2 next steps in diagnosis:

Please replace the /versions/boot/current/boot/olpc.fth file with the version I am attaching. This will add another line during the bootup sequence that says "Leaping...". When the hang occurs, we need to know if the "Leaping" message appears, or if the last line on screen continues to be "Loading ramdisk"

Secondly, I'd like to reconfirm an earlier theory that this only happens on cold boot, and never on reboot. This can be done by adding a line to the /etc/rc.local file that says:

(sleep 60 && reboot) &

Then reboot manually and leave the machine overnight. It will reboot on a continuous cycle. See if the boot hang ever occurs.

When you want to exit the loop, go to a virtual terminal as the system loads and quickly run "killall sleep" then revert the modification to /etc/rc.local. Or just reflash the system.

thanks!

Changed 4 years ago by dsd

olpc.fth that adds "Leaping" message boot hook

  Changed 4 years ago by dsd

One more thing: Before starting the reboot loop, disable idle-suspend as well.

follow-up: ↓ 18   Changed 4 years ago by hal.murray

What do I do to get the olpc.fth as text without the HTML wrapper?

I haven't been able to reproduce the hangs on booting. (The bug I can easily reproduce is WiFi hanging.)

I've had an XO-1 running os301 rebooting every few minutes for several days without hanging. That was with power saving disabled.

If I leave power saving enabled, the WiFi hangs after a while. I might be able to tweak my setup so that it doesn't break WiFi but does use the WiFi to wakeup occasionally, but I haven't tried working on that.

Part of the time it had a GPS USB serial unit installed.

in reply to: ↑ 17   Changed 4 years ago by pgf

Replying to hal.murray:

What do I do to get the olpc.fth as text without the HTML wrapper?

it's annoying, for sure. click on the attachment, and then at the very bottom of that page, it says "Download in other formats", and under _that_ it says, in pale gray, "Original format". while that looks like it's simply labeling the page you're looking at, it's not -- click on it -- it's actually a link to the format you're looking for.

  Changed 4 years ago by bernie

We have seen this bug today on an XO-1 running F11-0.88 build 300.

Boot kept hanging as described until we pulled the battery to reset the EC.

  Changed 4 years ago by dsd

I ran 3 systems overnight in a reboot loop, no crash.

  Changed 4 years ago by hal.murray

I've been booting from power-off every now and then. I haven't been able to reproduce it yet.

I'm using the power button to turn it off.

Did anything change between os300 and os301 that might have fixed this?

  Changed 4 years ago by glowcohen

I had the problem on both os300 and os301, also on os300py, which I installed after os301 and prefer, (when it boots).

  Changed 4 years ago by dsd

Thanks glowcohen, looks like you're pretty much the only person who can reproduce this frequently. Please help us with the 2 tests above.

  Changed 4 years ago by hal.murray

I just got one.

The last thing on the screen is "Leaping..."

  Changed 4 years ago by dsd

Hal, if possible please leave the machine in this state- don't touch it. We might be able to come up with some debugging ideas. Will get back to you soon.

  Changed 4 years ago by hal.murray

Hal, if possible please leave the machine in this state- don't touch it. We might be able to come up with some debugging ideas. Will get back to you soon.

No problem.

In case it helps: Input power is 8 watts, 0.12 A at 122 V.

  Changed 4 years ago by wmb@…

I like dsaxena's idea of using early_printk(). You can put calls to that in start_kernel() to see how far through that it is getting.

If early_printk() doesn't work early enough, you can do a poor-man's serial character output routine with:

   while ((inb(0x3fd) & 0x20) == 0)
       ;
   outb('A', 0x3f8);
   while ((inb(0x3fd & 0x40) == 0)
       ;

Replace 'A' with different characters, or make a function.

  Changed 4 years ago by glowcohen

Sorry, have been traveling without my XO. Looks like you are making progress, good.

  Changed 4 years ago by hal.murray

Hal, if possible please leave the machine in this state- don't touch it. We might be able to come up with some debugging ideas. Will get back to you soon.

If nobody says anything in a few hours, I'll reboot it.

  Changed 4 years ago by Quozl

I'm on duty, but I've no diagnostic suggestions, sorry. wmb's suggestions above will probably require a kernel replacement and further attempts to reproduce.

  Changed 4 years ago by dsd

Unfortunately the suggestion requires a serial console, and I don't think glowcohen nor Hal have one of these. :(

  Changed 4 years ago by hal.murray

I just got another one.

Again, the last thing on the screen was "Leaping..."

It may be that the trick to provoking this quirk on my system is to leave it turned off for a while. It was was just sitting there, plugged in with a good battery, because I had forgotten to boot it after turning it off cleanly when the WiFi hung in LED off mode.

  Changed 4 years ago by Quozl

We need an instrumented kernel with some way to get output from it, so we can find out where the kernel is failing to proceed. Considering all output devices:

  • serial port ... is the most obvious way, but this would require hal.hurray or glowcohen to add a connector and level converter or swap the laptops for our analysis.
  • the speakers ... probably going to take too long to figure out, but a setup recipe for the sound device is in both the ALSA driver and OpenFirmware.
  • the LEDs ... from memory there's a GPIO to toggle ... and with a series of kernels over several days a kernel developer may find the point of failure. An ugly prospect.
  • USB or wireless ... the software stack required is too large to consider.
  • SD ... my gut feel is that there's another stack too large.
  • the display ... the kernel frame buffer isn't available at this time, but is there a possibility of calling OFW from the kernel, to emit strings on the panel?

  Changed 4 years ago by hal.murray

I think there are two parts to this discussion.

What should we do for this particular bug? What can we do to get ready for similar bugs in the future?

How much code is there before normal screen output works?

* the LEDs ... from memory there's a GPIO to toggle ... and with a series of kernels over several days a kernel developer may find the point of failure. An ugly prospect.

Something like binary search assumes we have a reasonable test case. I'm not sure that's true. I've only hit it twice, and that was probably by accident/luck rather than because I have a good recipe.

How many LEDs are there that are easy to control from the CPU and/or EC? Can we learn enough with simple binary encoding? [It died between step X and step X+1.]

You can send a lot of info through a single LED by encoding a number as blink patterns. The key idea is that you aren't in a hurry so you can transmit slowly and retransmit many times so the observer can double check. Consider something like:

Blink, Blink, Blink, pause, Blink, Blink, long pause, repeat.

I think that approach needs something like an EC to do the work.

  Changed 4 years ago by wmb@…

We can dance around this problem or we can do what I suggested and debug it. we need to send a serial setup to the people who can reproduce the problem .. Perhaps someone who is not ostensibly on vacation could make that happen.

  Changed 4 years ago by dsd

I have pushed a hack to olpc-2.6.31 which outputs various characters on the serial console during very early boot on XO-1. So the next step is for someone to reproduce this crash with a serial console attached, viewing the last characters printed during the bootup sequence.

glowcohen, if you're interested in continuing to help diagnose this, please send your shipping address and a contact phone number to dsd@…. OLPC will send out a serial console kit, which basically allows you to connect the XO motherboard to a USB port on another computer and receive these diagnostic characters.

  Changed 4 years ago by dsd

The email address got hidden. please send the details to dsd @ laptop . org

  Changed 4 years ago by hal.murray

I may be getting closer to being able to semi-reproduce this.

I had 4 instances yesterday: 2 pairs. On one, it had been off (but plugged in) for a while.

A few minutes ago, I had 3 more. Again, it had been off for an hour or two.

If somebody has a good lab setup, it might be interesting to try some cold spray. I'm thinking of cooling the whole board slightly rather than freezing a particular chip. Or maybe put it in a freezer for a while.

This could (easily) be just my unit.

  Changed 4 years ago by martin.langhoff

  • priority changed from normal to blocker

As goal for 10.1.2 is sync release - this is obviously a blocker.

  Changed 4 years ago by cjb

I don't think it's obviously a blocker, although I'd obviously love to understand the bug better. I'm not seeing it with any regularity, and there's a workaround of just rebooting the machine, and the workaround is not hard to figure out; it's what people will try on their own.

  Changed 4 years ago by glowcohen

I appear to be in the minority, but I have to retry 4 times to get a successful boot, all the time.

  Changed 4 years ago by dsd

I don't think its a blocker either. I've tested a lot of laptops now, working for a month on this release, without reproducing this even once. (I'm not living in the mindset that 10.1.2 will be the last release this year, and we can probably also make good progress on this once Hal gets back from his break)

  Changed 4 years ago by Quozl

We need to assess the frequency of the problem.

glowcohen has one laptop that does it four times out of five. hal.murray has two laptops that does it a few times a day (out of how many successful boots?). I've not seen it on four laptops over several weeks. dsd has not seen it on 23 laptops. Who else has not seen it and on how many laptops?

My gut feel is that this is a hardware problem that was handled gracefully on 8.2.1 but ungracefully on our current builds, but we need to analyse it further.

  Changed 4 years ago by hal.murray

hal.murray has two laptops that does it a few times a day (out of how many successful boots?).

I have one XO that can easily reproduce the problem. I don't think I've see it on the other one.

The trick is to let it sit a while turned off. I don't know how long it has to wait.

It can be frustrating to get it to boot. In hindsight, I encountered this problem several months ago. Sorry I didn't report it then. (I was chasing something else and didn't notice the correlation with being turned off for a while.)

My first guess was that something like the CPU had to warm up. Since then, I've discovered that it only takes a few seconds of memtest to make it boot. There is probably more information in there, but I don't have any good ideas.

  Changed 4 years ago by tonyforster

I reflashed to OS373pyg and it crashed on the second and third boots. I haven't been able to replicate with hundreds of boots since. It crashed after the stuff about ramdisk and after the newline but before the screen changes from white background to black background.

Reading the previous posts, the common features are that the XO hadn't been used for a couple of weeks, it was cold(10C) and I had installed a new OS

  Changed 4 years ago by bernie

  • cc bernie added

  Changed 4 years ago by dsd

Hal has received the serial console and we've narrowed down the conditions of the hang. Call trace is i386_start_kernel --> reserve_ebda_region

http://lxr.free-electrons.com/source/arch/x86/kernel/head.c?v=2.6.31

This function does:

#define BIOS_LOWMEM_KILOBYTES 0x413
	lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
	lowmem <<= 10;

At this point, lowmem usually has value 0.

In the case when it crashes, lowmem has value 0x800 (meaning that the memory at 0x413 has value 0x2 instead of 0).

Then:

	ebda_addr = get_bios_ebda();

This consistently returns 0x20000 (in both the hang and non-hang case)

The function then does a few potential tweaks to the lowmem value (based on the values of lowmem and ebda_addr), then calls:

	/* reserve all memory between lowmem and the 1MB mark */
	reserve_early_overlap_ok(lowmem, 0x100000, "BIOS reserved");

and the hang occurs somewhere within that function call.

So, why is the value of BIOS_LOWMEM_KILOBYTES changing? Is this something controlled by OFW?

  Changed 4 years ago by dsd

I think we're onto something here. One key difference between 2.6.25 and the newer kernels is that 2.6.25 did not do the BIOS_LOWMEM_KILOBYTES dance. It just read the EBDA address, and reserved 4kb.

http://lxr.free-electrons.com/source/arch/x86/kernel/setup_32.c?v=2.6.25#L391

This changed in this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dedd04be71ce

which is evidently what broke things for us.

  Changed 4 years ago by wmb@…

Try http://dev.laptop.org/~wmb/q2e44e.rom . It sets up the BIOS Data Area, including location 0x413. Previous versions of OFW did not bother to set up the BDA for Linux (they did set it up for Windows) because, since time immemorial, Linux did not care about the contents of the BDA.

  Changed 4 years ago by wmb@…

If you have an XO-1.5, you can try http://dev.laptop.org/~wmb/q3a50e.rom . It similarly sets up the BIOS data area.

  Changed 4 years ago by hal.murray

Try http://dev.laptop.org/~wmb/q2e44e.rom . It sets up the BIOS Data Area...

I think that fixed it, at least for my setup. Thanks.

I've tried several times. I'm pretty sure that the old setup would have hung, but it's hard to be sure.

The key step in the make-it-hang recipe, at least for my system, is to leave the XO off "long enough". In the past, an hour has been plenty. I've given it at least an hour several times. It has always worked fine.

I'll keep trying. Maybe I'll switch back to the old firmware to verify that my recipe still works.

  Changed 4 years ago by Quozl

(boot tested q2e44e and q3a50e)

  Changed 4 years ago by wmb@…

  • owner changed from dsaxena to wmb@…
  • status changed from new to assigned
  • next_action changed from diagnose to add to release

svn 1941 instantiates the BIOS Data Area in all cases. Before, the BDA was instantiated only when booting Windows. This is a checkin of the fix from q2e44e/q3a50e.

  Changed 4 years ago by wmb@…

  • component changed from kernel to ofw - open firmware

  Changed 4 years ago by Quozl

  • version changed from Development build as of this date to 1.5 Software Build os206 aka 10.1.1

(reviewed code change.)

  Changed 4 years ago by dsd

Data point for future reference:

Mitch's "poor-man's serial character output routine" (above) was the method we used to debug this, and it was easy once we had a serial console combined with a laptop that reproduces the problem (thanks Hal).

And a tip that would have saved us a few rounds of back-and-forth: next time this happens, modify panic() to send the message through the serial routine, since in this case we were encountering a panic that would have clarified things substantially:

Overlapping early reservations 800-fffff BIOS reserved to 0-fff BIOS data page

(thanks to Quozl for the idea)

  Changed 4 years ago by cjb

  • next_action changed from add to release to test in build

test in os852

  Changed 4 years ago by Quozl

Isn't failing for me, but then it never did. Bernie, Hal, Tony, Glowcohen, please test os852. Bernie, please test q2e45 against your builds. Please report the number of successful boots.

  Changed 4 years ago by hal.murray

Isn't failing for me, but then it never did. Bernie, Hal, Tony, Glowcohen, please test os852. Bernie, please test q2e45 against your builds. Please report the number of successful boots

I haven't been able to make it fail. So far, that's only two tries, but I'm pretty sure they would have failed with the old setup.

  Changed 4 years ago by Quozl

(Withdrawn has been build os852 mentioned above this point in ticket history. 96379866c972367f40ec3e8accbaef49 os852.img)

  Changed 4 years ago by glowcohen

Update: Currently running os300py - I installed firmware q2e45, and so far have not encountered the boot hang problem.

  Changed 4 years ago by wmb@…

  • status changed from assigned to closed
  • resolution set to fixed

Based on the fact that we have root cause, a fix for that, and success reports from the two problem reporters, I'm declaring victory.

  Changed 4 years ago by Quozl

  • next_action changed from test in build to no action

Agreed.

Note: See TracTickets for help on using tickets.