#12525 closed defect (fixed)

XO-4 EC communication problem

Reported by: wad Owned by: rsmith
Priority: normal Milestone: 4-firmware
Component: embedded controller Version: Development source as of this date
Keywords: XO-4, EC Cc: dsd, shep
Blocked By: Blocking:
Deployments affected: Action Needed: no action
Verified: no

Description

On an XO-4 B1 (SKU293) laptop, running Q7B14 and EC 3.10, I saw an unrecoverable EC communication failure. The laptop was running runin (build 27) at the time.

The console log is attached.

I will note that this error happened while I was working near the laptop, and might have been due to an ESD event (not that I heard or felt any sparks, but...)

Attachments (6)

os27_1.2.log (90.8 KB) - added by wad 21 months ago.
os27_1.log (90.8 KB) - added by wad 21 months ago.
ec_error.log (33.7 KB) - added by wad 20 months ago.
Chia-Hsiu reports that Quanta QA has duplicated this problem without using suspend/resume. This is the serial console from that failure.
12525_ec.log (5.9 KB) - added by wad 20 months ago.
EC log from a number of failing runs using menu's switch button to trigger the bug
12525_ho.log (123.6 KB) - added by wad 20 months ago.
Console log from a number of failing runs (coincident with 12525_ec.log)
fail_w_S7.log (25.1 KB) - added by rsmith 20 months ago.
Comm fail with EC S7 debugging enabled.

Download all attachments as: .zip

Change History (19)

Changed 21 months ago by wad

comment:1 Changed 21 months ago by dsd

  • Cc dsd added
  • Milestone changed from Not Triaged to 4-firmware

Should this be treated as a production blocker?

comment:2 Changed 21 months ago by pgf

shep sent mail with a similar report:

I've been trying wake-on-lan on this machine, and then just a few
minutes ago I was doing a bunch of

        sudo ping -c 100 -i 0.01 10.0.0.152

at this laptop, and now it is printing this stuff out on the serial
console:



[ 5182.122694] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[ 5182.122717] power_supply olpc-battery: driver failed to report `capacity' pr
operty: 4294967186
[ 5182.137841] olpc-ec-1.75: transmit FIFO not empty, resetting
[ 5186.132785] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[ 5186.138247] power_supply olpc-battery: driver failed to report `status' prop
erty: 4294967186
[ 5186.146842] olpc-ec-1.75: transmit FIFO not empty, resetting
[ 5190.142686] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[ 5190.148204] power_supply olpc-battery: driver failed to report `status' prop
erty: 4294967186
[ 5190.156609] olpc-ec-1.75: transmit FIFO not empty, resetting
[ 5194.142692] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[ 5194.142704] power_supply olpc-battery: driver failed to report `status' prop
erty: 4294967186
[ 5194.159969] olpc-ec-1.75: transmit FIFO not empty, resetting
[ 5198.152689] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[ 5198.152701] power_supply olpc-battery: driver failed to report `status' prop
erty: 4294967186
.....
On the EC console (I think from before this started happening, and a
ways into it):

286540422:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
event mask was 0xffff, is now 0xffff
286638657:suspend_hint --> 1
suspend sync: 00000007
286668534:soc_sleep: fell
286678683:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
288339035:EEPROM write All
event mask was 0xffff, is now 0xffff
288493058:suspend_hint --> 1
suspend sync: 00000008
288496564:soc_sleep: fell
288496797:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
event mask was 0xffff, is now 0xffff
291189311:suspend_hint --> 1
suspend sync: 00000009
291192509:soc_sleep: fell
291192722:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
291368016:SDI switchwait noack
291372001:SDI switchwait noack
291375984:SDI switchwait noack
291379961:SDI switchwait noack
291383948:SDI switchwait noack
291387932:SDI switchwait noack
...

comment:3 Changed 21 months ago by pgf

  • Cc shep added

if/when this happens again:

  • see if the olpc-ec-1.75 line (20:) in /proc/interrupts is incrementing
  • if not, use sdkit to examine SSP interface
  • if we have EC serial connected, we should turn on S7 and make sure we know what's happening.

comment:4 Changed 21 months ago by pgf

just happened to me:

/proc/interrupts shows 1 increment per repetition of

[  724.081141] olpc-ec-1.75: EC cmd error: timeout in STATE 2
[  724.086793] power_supply olpc-battery: driver failed to report `present' property: 4294967186
[  724.095274] olpc-ec-1.75: transmit FIFO not empty, resetting

on the EC, with S67, we get this:

>1782851:SDI switchwait noack
1782853:SDI=3, got_ack (0,213), CMD 0

[ 1 second pause, nothing happens on kernel side]

>1783829:Cmd
1783830:CMD Switch
sent 0001 got 7000
1783834:SDI=5, got_ack (0,213), CMD 1
1783838:nocmd

[when these 5 lines are printed, kernel prints its 3 lines]

[ 3 second pause, repeat ]

comment:5 Changed 21 months ago by pgf

during this, dump_xdata_sfr reports:

X

GPIOFS  fc00: 00 00 40 00 00 00 80 00  00 00 00 00 c7 00 00 00
GPIOOE  fc10: 00 b0 6d 00 0f 40 01 08  30 30 10 1c 00 08 00 00
GPIOD   fc20: 00 a0 28 00 00 40 01 08  30 20 00 15 00 00 00 00
GPIOIN  fc30: df f7 ff ff f0 7f 1f ff  ff ff ff fe df ff 07 fe
GPIOPU  fc40: 00 00 00 00 0f 00 ff 00  00 00 00 00 20 00 00 00
GPIOOD  fc50: 00 00 00 00 00 40 00 00  30 00 00 00 00 00 00 00
GPIOIE  fc60: 20 cc 90 00 0f 80 ff 30  30 cc 00 03 20 00 00 01
GPIOMSC fc70: 64 00 00 00 00 00 00
KBC     fc80: 40 00 00 00 00 00 10 00  00 08 00 00 00 00 00 00
ESB     fc90: 00 00 00 00 00 00 00 00  00 00
IKB     fca0: 00 00 00 00 00 00 00 00  ff 00 f7 ff ff 55 00 04
PECI    fcd0: 00 00 01 00 00 00 00 00  00 00 00 00 00 00 00 00
OWM     fcf0: 00 00 00 00 00 40 40 2d  0a 50 2d 03 14
PWM     fe00: 11 00 fe 00 fe 00 c0 c0  00 00 3f fe 00 00 3f fe
FAN     fe20: 30 00 0f ff 00 00 00 00  00 00 0f fe 0f 00 0f ff
FAN     fe30: 00 00 0f ff 00 00 00 00  0f ff 00 00 0f 00 0f ff
GPT     fe50: 08 04 00 00 00 00 0c cd  00 1f
SDI     fe70: 1f 00 00 00 00 00 00 00  80 00 01 01 00 00 5a 00
WDT     fe80: 80 00 00 00 00 00 00 00  00 00 31 00 0f
XBI     fea0: 00 00 00 0f 00 05 04 00  83 00 00 17 90 08 00 00
XBI     feb0: 00 00 00 00 00 00 a5 96  2f 13
CIR     fec0: 00 00 c0 00 00 00 00 00  00 00 00 44 00 00 00 00
GWG     fed0: 00 00 00 00 00 00
PS2     fee0: 00 00 00 00 00 ef 0e
EC      ff00: c3 00 0f 90 00 00 00 00  00 c0 22 00 2f 1c 00 41
ADCDAC  ff10: 00 00 00 00 00 07 3e 83  35 7f 80 00 00 00 90 10
ADCDAC  ff20: 80 00 80 00 04 05 00 03  00 cc 01 5d 5d
EC      ff1d: 00 90 10 80 00 80 00 04  05 00 03 00 cc 01 5d 5d
GPWUEN  ff30: 00 08 00 00 00 00 00 00  00 44 00 00 00 20 00
GPWUPF  ff40: 20 84 90 00 0f 80 f1 00  30 cc 00 03 01 20 00
GPWUPS  ff50: 00 08 00 00 00 00 00 00  00 00 00 00 00 00 00
GPWUEL  ff60: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00
GPWUCHG ff70: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00
SMB0    ff90: 00 00 00 30 06 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
SMB1    ffd0: 00 00 00 30 06 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  02 7a fc 02 87 0d 7e 3f  7f 65 22 02 87 13 7e 49

the ACK bit is bit 5 of the PF[0xd] or EC[0xd] register. it's pending, and enabled, as i read it.
but we're clearly not getting acks, because the debug count isn't going up.

but i just probed it, and we're definitely getting 1us 3.2V positive pulses on ACK. so linux is able to strobe it.

the dump_8051_sfr() output is:

>I
TCON: 0x00, TMOD: 0x10, TL0: 0x00, TL1: 0x00, SCON: 0x50, P2: 0xf4,
IE: 0x98, SBUF: 0x0d, SCON2: 0x01, PCON2: 0x30, SCON3: 0x21, TH0: 0x00, TH1: 0x00,
SCON4: 0x00, PCON: 0x00, ACC: 0x00,
P0IE: 0x00, P1IE: 0x80, P3IE: 0xa2,
P0IF: 0x00, P1IF: 0x00, P3IF: 0x00,

the dump_wakes() output is:

5844373:Wakes
  WU[1] = EN:08 PF:84
  WU[9] = EN:44 PF:cc     # the CC bits are TD,TC,KC,KD  (T=tpad, K=kbd)
  WU[d] = EN:20 PF:20     # this is the ACK interrupt bit

i think this means there are pending and enabled tpadclk and kbddat interrupts. since these are the same priority as ACK, but they may take precedence since they're lower bits?
answer: no. writing 0 to ff39, which leaves:

6908402:Wakes
  WU[1] = EN:08 PF:84
  WU[d] = EN:20 PF:20

leaves the missing ACK symptom unchanged.

writing 0x20 to PF[0xd] to clear the pending ACK flag ("w ff4d 20") did clear the bit, but the code went into an infinite spin -- i've lost the scrollback, it was so fast. writing 0 to EN[0xd] ("w ff3d 0") got things under control, and writing it back to 0x20 ("w ff3d 20") made things work again.

so for some reason we weren't servicing the ACK interrupt. the P3IE value looks okay. i'm at a loss.

comment:6 Changed 21 months ago by pgf

more data: through all that, i was running on external power, with no battery installed. so stop mode should never have been an issue.

comment:7 Changed 21 months ago by pgf

smithbone points out similarity to EC commit: 1c0ef4e6

comment:8 Changed 21 months ago by greenfeld

An EC transmit FIFO not empty loop was seen while using a XO-4 with 13.1.0 os29, Q7B15pg, idle suspend/resume disabled, battery inserted (running on battery power).

comment:9 Changed 20 months ago by greenfeld

Seen on a SKU295 system doing DEVL runin with os30/Q7B15/EC 0.3.11. The EC pattern was slightly different from comment 2's though:

424007016:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
thw time 10000
thw enabled
424020863:thw 10000
event mask was 0xffff, is now 0xffff
424021930:suspend_hint --> 1
suspend sync: 00000b9a
424027619:soc_sleep: fell
424027828:suspend_hint --> 0
event mask was 0xffff, is now 0xffff
424030866:thw fire
424039617:SDI wait timeout
424042620:SDI switchwait noack
424045623:SDI switchwait noack
424048626:SDI switchwait noack
424051657:SDI switchwait noack

Changed 20 months ago by wad

Chia-Hsiu reports that Quanta QA has duplicated this problem without using suspend/resume. This is the serial console from that failure.

Changed 20 months ago by wad

EC log from a number of failing runs using menu's switch button to trigger the bug

Changed 20 months ago by wad

Console log from a number of failing runs (coincident with 12525_ec.log)

comment:10 Changed 20 months ago by wad

All runs except the last were: boot laptop into OFW, run menu, select the "switches" button using the touchscreen/touchpad. When prompted to close the laptop to test the switch, I instead press ESC, then press ESC again. Then I press the "stop (X)" button in menu, then type "boot".

The last run, I typed "watch-battery" (which worked) and let it update a few times before typing "boot".

Changed 20 months ago by rsmith

Comm fail with EC S7 debugging enabled.

comment:11 Changed 20 months ago by wmb@…

I have reproduced the problem. Data point: It will also repro if you let the switch test succeed, i.e. close the lid to activate the switches. So aborting from the test with ESC is not a necessary condition.

comment:12 Changed 20 months ago by pgf

shortest reproducer:

  ok " mouse" open-dev close-dev s3-selftest
  ok .bat many

< now use the touchpad >

it seems the EC can get confused (and refuse to run interrupt handlers) if it gets simultaneous interrupts under some conditions. at least, that's the current theory, and avoiding get touchpad/keyboard interrupts along with ACK interrupts seems to prevent the current bug.

workedaround in EC commit 9def2826f.

comment:13 Changed 20 months ago by pgf

the reproducer technique above is incorrect. it should be:

 ok " mouse" open-dev s3-selftest
  ok .bat many

< now use the touchpad >

comment:14 Changed 20 months ago by Quozl

  • Action Needed changed from diagnose to no action
  • Resolution set to fixed
  • Status changed from new to closed

Fixed in EC 0.4.02 which is in Q7B22. Tested using simplified reproducer. Please re-open if alternate reproducers reproduce the symptom.

Note: See TracTickets for help on using tickets.