Opened 3 years ago

Closed 3 years ago

#10901 closed defect (fixed)

XO-1.75 A3 fails to boot OFW

Reported by: wad Owned by: wad
Priority: blocker Milestone: 1.75-firmware
Component: hardware Version: 1.75-A3
Keywords: 1.75 Cc:
Blocked By: Blocking:
Deployments affected: Action Needed: diagnose
Verified: no

Description

On two occasions now, my 1.75 A3 prototype (#39, fully packaged into a laptop) has failed to boot from a cold start.

It is running OFW Q4A13h, and the EC firmware built by Mitch on 5/19 which staggers the SD card turn-ons.

The first time, the message printed on the screen was:
"Data Abort", followed by the ok prompt (nothing else was printed).
The second time, the message printed was "Software Interrupt<cr>ok Software Interrupt<cr>ok" (where <cr> is a carriage return).

In both cases power cycling the laptop resulted in a complete boot. It is impossible to duplicate without letting the laptop sit powered down for many minutes.

One postulate is that this is due to a switching transient on one of the power rails, as devices are first turned on by OFW.

(Can we please get a 1.75-A3 hardware selection added to "Version" ?)

Change History (8)

comment:1 Changed 3 years ago by wad

Checking with a serial port attached, the output looks like:

keypad

xid

Software Interrupt

ok

comment:2 Changed 3 years ago by Quozl

  • Milestone changed from Not Triaged to 1.75-firmware
  • Summary changed from 1.75 A3 fails to boot OFW to XO-1.75 A3 fails to boot OFW
  • Version changed from not specified to 1.75-A3

Triage. Add version as requested. Add milestone for inclusion in http://dev.laptop.org/1.75

comment:3 Changed 3 years ago by wad

One of the A3 motherboards undergoing further testing (#22) showed clear, reproducible case of 10901 (with a serial port attached, it generated an OFW "Data Abort" right after turning on the display.) It was running from battery at the time.

As it continued to boot, I let it boot up, and the full qualification suite (eMMC testing, JPEG decoding, and memory testing) on it for six hours, with no errors !
Except for the problem at cold boot, I would declare this motherboard functional...

comment:4 Changed 3 years ago by wad

Running from battery is not necessary. #22 exhibited the same behavior when powered from an adapter. It only happens for the first couple of boots when completely at
room temperature.

comment:5 Changed 3 years ago by wad

Also being tracked as Marvell Simplicity ticket 452826:

https://support.marvell.com/issues/9c0ec56498da7841f6a52e0191768d4a

Hardware information (Vcore traces of both good and failing boots on several boards) is present at: http://dev.laptop.org/~wad/10901/

comment:6 Changed 3 years ago by wad

At this time, I believe this might be two different problems and am debugging it as so. In Cambridge we have two boards which won't boot unless they have elevated Vcore voltages: #5 boots at 1.425 and #32 boots at 1.395V.

We also have three boards which won't boot reliabily the first time. After the SOC has been powered for thirty seconds or so, they boot reliably with default Vcore (nominally 1.345V). These are #1 (1.345V), #14 (1.350V), and #22 (1.345V -- measured at the SoC).

Cooling the SoC slightly allows this problem to be reliably reproduced. Raising Vcore on these boards makes them boot reliably. #1 boots reliably at 1.425V.

In contrast, a working board (#3) runs with a Vcore range of 1.330V to 1.360V, but doesn't work at 1.380 or 1.40V.

Attempts to read fuse block three from the sercurity processor to determine the voltage profile of the SoC reads all zeroes for that fuse block --- using the same access mechanism which allows us to read fuse blocks 1 and 2. This either indicates that we don't know how to read fuse block three, or that Marvell wasn't correct when they told us these parts were fully tested and voltage profiled.

comment:7 Changed 3 years ago by wad

#22 and #14 boot reliably at 1.415V, even when cold. At 1.40V, they "almost boot" making it into Linux before throwing a kernel panic.

comment:8 Changed 3 years ago by wad

  • Resolution set to fixed
  • Status changed from new to closed

In the end, all of these problems turned out to be improper initialization of the MMP2 memory controller.

Use of OFW Q4B03 or later should fix the problem.

Note: See TracTickets for help on using tickets.