Opened 4 years ago

Closed 11 months ago

#10314 closed defect (fixed)

XO-1.5 won't boot

Reported by: wad Owned by: wad
Priority: high Milestone: 1.5-hardware-C
Component: hardware Version: 1.5-C2
Keywords: XO-1.5, boot, solder Cc: rsmith, wmb@…, gary@…, beckham.chen@…, bert
Blocked By: Blocking:
Deployments affected: Action Needed: no action
Verified: no

Description

Several XO-1.5 laptops have stopped booting. The EC boots fine, and the power light turns on when the power button is pressed, but there are no signs of CPU life: 1) the display flickers but is never turned on 2) there is no boot sound 3) there is no output on the CPU serial port.

The three units I'm currently seeing this on are:

A B2 prototype, serial number SHC93701195, received from Tiago Marques. Report from the user stated:

"My XO-1.5 B2 just entered this state last night. The only difference is that I was using it, writing e-mail, and then it rebooted itself. After that it would no longer boot so I tried to run memtest. It crashed on the second test with a random character on the right side of the screen and then it would no longer boot, the screen just flashes."

A C2 ramp laptop, serial number SHC00500908, received from George Mavrothalassitis. Report from the user stated:

"The XO was OK running OS203. I clean installed os205 with the A/C plugged in. On first boot (AC on) it froze about 4 sec. into the boot process, I _think_ at "...mass storage devices" on the console. Mic or camera lights were _not_ on. On hard reboot (with the power button for 5 sec.) the backlight blinks for a split second but the screen stays dark. The power light coms on but not the processor and wifi lights. No chime either. This remains the case for all consecutive boot-ups."

A C2 ramp laptop, serial number SHC005007AB, used at 1CC for SD card testing. One morning, it decided to stop booting. (This unit was operated without a back panel installed, and may have been damaged through ESD.)

At first glance, these motherboards seem OK. All power supplies are producing the right voltages, and power sequencing to the processor (+1.8V, VCCP, VCORE, along with CPURST and CPUPWRGD)
looks OK. The clock to the processor is present and roughly correct (although it's exact frequency wasn't checked).

Attachments (7)

mavroth.rom (1.0 MB) - added by wad 4 years ago.
The contents of the SPI Flash ROM on SHC00500908
HDSTB1N_Normal.PNG (7.1 KB) - added by Gary Chiang 4 years ago.
HDSTB1N_fail.PNG (7.4 KB) - added by Gary Chiang 4 years ago.
HD#21_Normal.PNG (6.9 KB) - added by Gary Chiang 4 years ago.
HD#21_fail.PNG (7.0 KB) - added by Gary Chiang 4 years ago.
HD#15_Normal.PNG (6.1 KB) - added by Gary Chiang 4 years ago.
HD#15_fail.PNG (6.5 KB) - added by Gary Chiang 4 years ago.

Download all attachments as: .zip

Change History (38)

comment:1 Changed 4 years ago by wad

The mystery continues. Probing the ADS line on the processor, SHC00500908 was trying to boot. Unlike a normal boot, after an initial brief flurry of accesses, no further access is made.

The SPI Flash on SHC00500908 was replaced with one containing q3a50. There was no change in behavior. The contents of the SPI Flash that was present are attached to this ticket as mavroth.rom. It should contain q3a39 (os203 and os205 used this), and doesn't indicate a difference until byte 0x0EFE00 (manufacturing data ?).

Changed 4 years ago by wad

The contents of the SPI Flash ROM on SHC00500908

comment:2 Changed 4 years ago by wmb@…

The final sector - after the manufacturing data - in mavroth.rom was intact as well, matching q3a39. So SPI FLASH corruption is doubly ruled out for that unit.

Perhaps connect an EC serial port to see if there is any useful port 80 activity.

comment:3 Changed 4 years ago by martin.langhoff

Got some reports in La Rioja of this. Clears up with a complete power removal.

Behaviour does not match #9803 however -- we do have power LED on.

comment:4 Changed 4 years ago by wad

  • Status changed from new to assigned

No port 80 activity is shown on any of these laptops.

SHC00500908 shows a bus access burst at boot of 5.6 mS.
SHC93701195 shows a bus access burst at boot of 1.2 mS.
SHC005007AB shows a bus access burst at boot of 1.2 mS.

Looks like the CPU is trying to boot but somehow the data is getting garbled...

comment:5 Changed 4 years ago by martin.langhoff

Wad points out -- this bug appears to "brick" the laptop, and is not transient.

The problem experienced in LR is transient (luckily!)

comment:6 Changed 4 years ago by kevix

I received an XO C2 ramp unit (SN# SCH00500826) from holt on 13 JUN 2010. The unit last work on 21 AUG 2010. On 22 AUG 2010, the unit tried to boot then rebooted and was then bricked. The symptoms are: (with battery removed and only power adapter):
no leds, then power led blinks on and off green, then battery led blinks on and off red. When the power button is pressed, the screen flashes its backlight? for a tenth of a second as though it was going to light up, then does not. The power led turns on and stays on. The mic light, the wifi leds, and the battery led are off. No chime sounded. I chatted with cjb, holt and quolz and this seems to be the same issue. I was told to prepare the unit to be shipped to 1CC attn richard. The unit had the microsd removed, rsynched and put back. I can also report that the microsd in a C2 unit worked with no issues.

comment:7 Changed 4 years ago by wad

Just adding that I've confirmed that this problem is the cause of failure of both SHC00500826 (kevix) and SHC9370111D (mikus) by scoping out signals on the boards.

comment:8 Changed 4 years ago by wmb@…

Any chance of decoding the last access in the burst to determine its address?

One thing to consider: Prior to fetching instructions, the Via chipset first reads a 4-byte "SIP pointer" from ROM offset 0xfff80, then a 32-byte "SIP" data structure from ROM offset 0xfffd0. It contains configuration values that are used to initialize the GTLPHYs in the host bus driver block.

Sometime after that, the CPU then reads 5 bytes of instruction from ROM offset 0xffff0 , then jumps to offset 0xffc30 to begin the early startup sequence in earnest. The first thing that happens at offset 0xfffc30 is a write of 0x01 to port 80.

comment:9 Changed 4 years ago by wmb@…

Just to emphasize a subtle point about the sequence described in the previous comment:

The SIP access is done by the VX855 chip directly, not by the CPU. The CPU doesn't get into the act until the 0xffff0 access.

comment:10 Changed 4 years ago by wad

  • Cc beckham.chen@… added

There is additional timing information gleaned from SHC00500826 tonight. The VX855 appears to be making the LPC requests properly for the SIP data, followed by initial boot code. It then quickly stops requesting data.

Scope traces describing this are available at:
http://dev.laptop.org/~wad/xo1.5_boot/10314.html

I will hook up the DLA to the LPC bus tmw., but have no way of easily accessing the FSB due to it's internal routing and a lack of test pads.

comment:11 Changed 4 years ago by wmb@…

I made a rough estimate of where the code might be by counting LFRAME assertions. (actually I used calipers; I didn't really count them all). There appear to be about 260 LFRAMEs in the CPU group. If my analysis is correct, that puts the death point right in the middle of a sequence of "write MSR" instructions that are clearing Memory Type Range Registers. I don't see anything particularly interesting about that area of the code. I can't think of why that particular sequence would be prone to failure.

Perhaps my analysis is flawed. It will be interesting to see the fetch addresses on the LPC bus.

Changed 4 years ago by Gary Chiang

Changed 4 years ago by Gary Chiang

comment:12 Changed 4 years ago by Gary Chiang

After further investigation on SHC00500908,
I found out that "HDSTB1N" pin maybe soldering crack on VX855 side(the HDSTB1N should connect between CPU and VX855),
Please find attached file, the "HDSTB1N_Normal.PNG" is measured on normal board,
You can see the negative-edge level only reach to 400mV,
In the "HDSTB1N_fail.PNG" (measure on SHC00500908 ) the negative level will reach to almost 0V,
Due to this pin have internal pull-up resister 55 ohm on both CPU and VX855 side ,
Base on this waveform, it can tell you one of the pull-up resister are some how disconnected.
I guess the soldering crack happened on VX855 side, due to the CPU will fetch code in the first place,
If CPU side is soldering crack , I should never have chance to measure this waveform.
Will pass this board to VIA tomorrow, VIA can help us double confirm the root cause.
I am still analysis the SHC005007AB, will update if I found anything.

Changed 4 years ago by Gary Chiang

Changed 4 years ago by Gary Chiang

comment:13 Changed 4 years ago by Gary Chiang

After further analysis , I found out the SHC005007AB also have similar problem,
The pin HD#21 maybe soldering crack on this board.
This pin have internal resister (55ohm ) on both CPU and VX855.
Please find attached file for failed (HD#21_fail.PNG) and normal (HD#21_Normal.PNG) waveform,
It looks like one of the pull-up resister are some how disconnected (same with SHC00500908 ).
But ,I don`t know which side (CPU or VX855) are soldering crack , this pin only go with inner layer.
I will pass this board to VIA tomorrow , they can do reflow BGA or remount the BGA to find out the actual root cause.

BTW
about SHC00500908 with "HDSTB1N" pin cold soldering,
I cut off "HDSTB1N" trace on a GOOD board, the board will fail to boot and the failed symptom are same with #10314,
After measure the both side (CPU and VX855) waveform on this test case.
I am now pretty sure the cold soldering crack was occurred on CPU side , rather than VX855 side.

Changed 4 years ago by Gary Chiang

Changed 4 years ago by Gary Chiang

comment:14 Changed 4 years ago by Gary Chiang

I receive the SHC02000296 today, the unit is come from Martin.
This unit have soldering crack on FSB (HD#15) BGA as well, please refer to attached waveform.
The "HD#15_Normal.PNG" is normal waveform , "HD#15_fail.PNG" is measured on SHC02000296,
The result is consist with SHC00500908 and SHC005007AB.

comment:15 Changed 4 years ago by wad

  • Action Needed changed from diagnose to test in release
  • Keywords solder added

Two boards were given to Via (maker of the CPU) for failure analysis. They ground down the motherboard to access traces on the inner layers and also the top of the CPU interposer (between the chip and the MB) to access the signals, verified that there was no connection and that the I/O input on the CPU was undamaged. They then pulled off the CPUs, and they tested fine.

This indicates that the root cause of this problem is likely to be cold soldering (or possibly lead-free solder ball cracking ?) Quanta will attempt to reheat other returned motherboards to see if it repairs the problem. It is likely that the reflow profile for the motherboard soldering will be modified to minimize this problem.

On XO-1, cracking of solder balls underneath the processor has been a persistent problem reported by repair centers. In those cases, it is believes that mechanical flexing of the motherboard due to pressure on the back of the case was a contributing factor. In XO-1.5, due to the much smaller processor area, we did not expect to see this problem.

comment:16 Changed 4 years ago by Quozl

Identical symptoms in the past few weeks on unit SHC005007AD (tagged C3 MODS ACIN ECO SOLAR QUOZL) returned by a test child. Wad: what to do? Send it back?

comment:17 Changed 4 years ago by Quozl

Ahmed Mansour reports unit SHC9370109E is also experiencing this problem.

comment:18 Changed 4 years ago by Quozl

Dr. Gerald Ardito: SHC02800180 and SHC02800145. Reuben handling.

comment:19 Changed 4 years ago by Quozl

Cherry Withers machine SHC937011C5 reported affected on #olpc-help; main battery and DC power were also removed to verify. Cherry will contact the Contributors Programme.

comment:20 Changed 4 years ago by greenfeld

SCH020002B0 appears to have this issue; no display output except for an initial flash, no CPU serial data, EC power button and charging behavior seem normal (at least according to the LEDs).

comment:21 Changed 4 years ago by Quozl

SHC005008AC returned by a test child with same symptoms.

comment:22 Changed 4 years ago by bert

  • Cc bert added

My "B-Test" machine (SHC93701133) stopped working with these symptoms. Was working just fine the day before. IIRC then it froze on boot (first dot) so I had to power cycle (long press on power button). Then it would turn on the backlight but not display anything. Now only the Power LED comes on, nothing else. Removed power for half an hour but no change.

comment:23 Changed 4 years ago by dsd

  • Milestone changed from Not Triaged to 1.5-hardware-C

comment:24 Changed 3 years ago by Quozl

SHC93701699 B2 RIP

comment:25 Changed 3 years ago by wmb@…

I just lost a couple of rev Gs. One with the exact symptom described in the initial report, the other won't even turn on the green power light.

comment:26 Changed 3 years ago by Quozl

SHC93701158

comment:27 Changed 2 years ago by carrott

SHC93701143 B2 & SHC005008D1 C2

comment:28 Changed 2 years ago by mavrothal

SHC0160134E had the some problem.
Was resurrected with an oven bake ( http://lists.laptop.org/pipermail/devel/2012-October/036177.html )

comment:29 Changed 18 months ago by mavrothal

My oven baked XO-1.5 board died again after 8 months and a guestimated 400+ hours of use.

comment:30 Changed 12 months ago by Quozl

SHC016012CC (2010-07-06) Caryl.

comment:31 Changed 11 months ago by Quozl

  • Action Needed changed from test in release to no action
  • Resolution set to fixed
  • Status changed from assigned to closed

Some units may be misdiagnosed with #10314. One such unit, SHC13100C17 (.pk) was found to have an active serial port, with the firmware failing to start up due to a DCON I2C NAK, which points at U19 U17 R201 R202 DCONSMBDATA DCONSMBCLK.

However, as XO-1.5 is not in production we can close this ticket now.

Note: See TracTickets for help on using tickets.