Ticket #1835 (closed defect: fixed)

Opened 7 years ago

Last modified 4 years ago

Unable to resume via power button nor wireless.

Reported by: jcardona Owned by: wad
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: kernel Version:
Keywords: power Cc: jg, rsmith, dsaxena, gregorio
Action Needed: never set Verified: no
Deployments affected: Blocked By:
Blocking:

Description (last modified by jg) (diff)

While working on #1755 I saw a failure mode were the xo would not resume by any means (wireless nor power button). The only workaround was to reboot.

Attachments

suspend_resume.xls (18.5 kB) - added by gary 7 years ago.

Change History

  Changed 7 years ago by jg

  • description modified (diff)
  • component changed from distro to kernel
  • priority changed from normal to blocker
  • owner changed from jg to dilinger
  • milestone changed from Untriaged to Trial-2
  • keywords power added

Without a test case, this one will bite the dust eventually. However, while we're working on suspend/resume, this may remind us that such a problem is occuring.

  Changed 7 years ago by cjb

I attached a serial log of this failure mode to #1752.

  Changed 7 years ago by jg

  • owner changed from dilinger to jcardona

Javier, can you still reproduce this will current builds? if so, which one?

  Changed 7 years ago by jcardona

  • status changed from new to closed
  • resolution set to fixed

Could not reproduce on:

OFW: Q2C18
wireless fw: 5.110.16p1
build: 530

  Changed 7 years ago by cjb

  • status changed from closed to reopened
  • resolution deleted
  • milestone changed from Trial-2 to Trial-3

Hmph, I just hit this while doing suspend/resume tests. Power button/game keys/wireless aren't resuming me, and nothing past "olpc_do_sleep!" on serial.

  Changed 7 years ago by jcardona

Same here, after 375 suspend/resume iterations. With the debugger I could see that the wireless firmware is in valid state, periodically resending the wake-up signal to the host.

follow-up: ↓ 8   Changed 7 years ago by cjb

  • owner changed from jcardona to cjb
  • status changed from reopened to new

Assigning to cjb to give an exhibiting laptop to rsmith.

in reply to: ↑ 7   Changed 7 years ago by cjb

Replying to cjb:

Assigning to cjb to give an exhibiting laptop to rsmith.

Looks like we have one now. (The machine on the USB analyzer.)

follow-up: ↓ 10   Changed 7 years ago by jcardona

Mitch Bradley has prepared an debug open firmware image that includes support for suspend/resume and uses the debug wireless firmware build usb8388-wake-up-host-after-programmable-delay.bin described here.

With that open firmware and this recipe:

sci-wakeup                                                                             
wifi <SSID>
select /wlan                                                                           
patch exit link-up? close                                                              
unicast-wakeup                                                                         
unselect                                                                               
0                                                                                      
select /wlan autostart sleep unselect 5 ms s 1+ dup . 200 ms many                      

...it is easy to reproduce this problem (on some machines?).

The test should do infinite suspend/resume cycles, but does not. After a few iterations it fails to resume:

ok select /wlan autostart sleep unselect 5 ms s 1+ dup . 200 ms many                      
+r1 +r2 +r3 +r4 +r5 +r6 +r7 +r8 +r9 +ra 

Furthermore, the wireless firmware shows that even in the successful resumes, the host seems to ignore some SCI events. Below is a debug log from the wireless firmware: when the first time the wake-up GPIO is asserted, 'wakey' is printed. If the host does not confirm that it's awake within 2 seconds, the wireless firmware will re-assert the gpio line (log message 'wakeup retry').

0xC00085F2 : wakey!
0xC00085F2 : wakey!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00085F2 : wakey!
0xC00085F2 : wakey!
0xC00084A8 : wakeup retry!
0xC00085F2 : wakey!
0xC00085F2 : wakey!
0xC00084A8 : wakeup retry!
0xC00085F2 : wakey!
0xC00085F2 : wakey!
0xC00084A8 : wakeup retry!
0xC00085F2 : wakey!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
0xC00084A8 : wakeup retry!
(forever)

I have seen this on a B3 and a B4. After the failure, the B3 had the mic light on, and not the B4.

in reply to: ↑ 9   Changed 7 years ago by jcardona

Replying to jcardona:

Mitch Bradley has prepared an debug open firmware image that includes support for suspend/resume and uses the debug wireless firmware build usb8388-wake-up-host-after-programmable-delay.bin described here.

That ofw image is available here: http://dev.laptop.org/~wmb/q2c26b.rom

  Changed 7 years ago by wad

The latest firmware addressing this problem is at: http://dev.laptop.org/~wmb/q2c26d.rom

The usage is: wifi <wifi SSID> wackup

When it stops printing, it is crashed. Relevant factors to report are: LED lit or not, serial number of board, number of passes

  Changed 7 years ago by wad

Make that:

wifi <wifi SSID>

wackup

  Changed 7 years ago by jcardona

Some test runs with http://dev.laptop.org/~wmb/q2c26d.rom.

S/N Iterations LED
SHF72000058 (B3-2) 11 on
SHF72000058 (B3-2) 1 on
SHF72000058 (B3-2) 1 on
SHF72000058 (B3-2) 4 on
SHF72000058 (B3-2) 3 on
SHF725004A1 (B4-14) 166 on
SHF725004A1 (B4-14) 287 on

  Changed 7 years ago by jcardona

S/N Iterations LED
SHF725004A1 (B4-14) 514 on

  Changed 7 years ago by wad

  • owner changed from cjb to wad
  • status changed from new to assigned

Today we spent our time characterizing the problem across builds. Here are my results (+ indicates that the run was manually terminated, there was no crash):

<table class="wiki"> <tr><td> <strong>S/N</strong> </td><td> <strong>Iterations</strong> </td><td> <strong>LED</strong> </td></tr> <tr><td> SHF72000072 (B3) </td><td> 1/3/9/1 </td><td> on/on/off/on </td></tr> <tr><td> SHF7200002C (B3) </td><td> 5/7/12/2/6</td><td> on/on/on/on </td></tr> <tr><td> SHF7250015F (B4) </td><td> 12/16/68/90/26/138/48+/160+/300+</td><td> on and off and one power-off on crash </td></tr> </table>

In a different class were these machines, which just kept on resuming:

<table class="wiki"> <tr><td> <strong>S/N</strong> </td><td> <strong>Iterations</strong> </td><td> <strong>LED</strong> </td></tr> <tr><td> SHF733000D1 (C1-4) </td><td> 768+ </td><td> no crash </td></tr> <tr><td> SHF733000CD (C1-4) </td><td> 768+ </td><td> no crash </td></tr> <tr><td> 2 laptops (C1-1) </td><td> 256+ </td><td> no crash </td></tr> <tr><td> 2 laptops (C1-2) </td><td> 4096+ </td><td> no crash </td></tr> <tr><td> 2 laptops (C1-3) </td><td> 4096+ </td><td> no crash </td></tr> <tr><td> 5 laptops (C1-4) </td><td> 256+ </td><td> no crash </td></tr> <tr><td> 1 laptop (C1-5) </td><td> 4096+ </td><td> no crash </td></tr> <tr><td> 2 laptops (C1-6) </td><td> 4096+ </td><td> no crash </td></tr> </table>

And there were the oddities: <table class="wiki"> <tr><td> <strong>S/N</strong> </td><td> <strong>Iterations</strong> </td><td> <strong>LED</strong> </td></tr> <tr><td> ? (C1-5) </td><td> 1500 </td><td> power-off on crash </td></tr> <tr><td> SHF733000DE (C1-4) </td><td> 1/1/1/18/68/48+/150+/800+/800+</td><td>on and off and refusal to reboot until EC reset</td></tr> </table>

I took a look at the LPC_CLK (on a B3 SHF72000072) using a spectrum analyzer, and noticed no difference when the board was failing. I disabled spread spectrum clocking on the motherboard (same B3) and there was no difference in the occurences of failure.

Mitch's test code is definitely triggering both this bug and bug #1752.

  Changed 7 years ago by wad

Bugger all, evidently WikiFormatting doesn't include raw HTML, which wiki's do!

Here are those tables again, in a readable format. First the failures:

S/N Iterations LED
SHF72000072 (B3) 1/3/9/1 on/on/off/on
SHF7200002C (B3) 5/7/12/2/6 on/on/on/on
SHF7250015F (B4) 12/16/68/90/26/138/48+/160+/300+ on and off and one power-off on crash

In a different class were these machines, which just kept on resuming:

S/N Iterations LED
SHF733000D1 (C1-4) 768+ no crash
SHF733000CD (C1-4) 768+ no crash
2 laptops (C1-1) 4096+ no crash
2 laptops (C1-2) 4096+ no crash
2 laptops (C1-3) 4096+ no crash
5 laptops (C1-4) 256+ no crash
1 laptop (C1-5) 4096+ no crash
2 laptops (C1-6) 4096+ no crash

And there were the oddities:

S/N Iterations LED
? (C1-5) 1500 power-off on crash
SHF733000DE (C1-4) 1/1/1/18/68/48+/150+/800+/800+ on and off and refusal to reboot until EC reset

Changed 7 years ago by gary

  Changed 7 years ago by gary

Hi John,

please reference suspend_resume.xls

when I measure C1 "SHF3300026", main_on and wakeup_ec signal it turns out sometimes South Bridge will ignore wakeup_ec event, and accept next wakeup_ec event,I don`t know why.

when CPU hang up,the Main_on and power rail has been turn on which means the EC is already wakeup South Bridge,even MIC LED is turn off by OFW then CPU hang up ,so you want to verify the LPC and PCI bus. did you think it is related with power dischange time? I was checked each power rail on faild OX, there are dischange very quick ,I believe power dischrng is not a issue. I will try to test wakeup on power button

will let you know any further information.

  Changed 7 years ago by wad

Correcting the information submitted above for laptops w. no serial number:

S/NIterations LED
SHF73300101 (C-5) Power-off on crash
SHF733000E9 (C-5) ~3000 Power-off on crash
SHF7330002D (C-1) 48000+ no crash
SHF73300031 (C-1) 48000+ no crash
SHF7330004C (C-2) 33000 on
SHF73300074 (C-2) 45000+ no crash
SHF73300081 (C-3) ~3000 uninitialized memory pattern on screen, rebooted w. power button
SHF73300083 (C-3) 48000+ no crash
SHF73300113 (C-6) 22000 on
SHF7330010C (C-6) 14000 on

  Changed 7 years ago by jcardona

One of my B3's did not fail after running wackup over the weekend:

SHF720000B0 (B3-4) 106000+ no crash

  Changed 7 years ago by jg

  • cc jg added

  Changed 7 years ago by kimquirk

  • milestone changed from Untriaged to Trial-3

  Changed 7 years ago by wad

This bug will be fixed in the MP build.

This was really a problems with clock stabilization and various power management issues, and is described in detail in the B4 Suspend ECR instructions.

To the user, it shares symptoms with #1752 (USB wireless suspend/resume failure), but differs in that a kernel actually prints messages indicating the problem on #1752.

The definitive description of this problem would a failure on suspend/resume, soon enough after a resume that Open Firmware is still running. There is nothing printed on the processor's serial console.

Occasional cases where drivers other than the wireless driver crashed the machine upon resume were also fixed by the changes to fix this problem.

  Changed 7 years ago by wad

  • status changed from assigned to closed
  • resolution set to fixed

  Changed 6 years ago by wad

  • cc rsmith added
  • status changed from closed to reopened
  • resolution deleted
  • milestone deleted

This is formal notice that verification of the fix to this problem (by masking SCIs) in recent EC code has failed.

Using build 1372, and Q2D07, ECO'd and current build machines (specifically those with keyboards, not bare boards) failed to pass 2K suspend/resume cycles triggered by network packet arrival. The test script was manually sending the EC a 0x32 command before placing the system in suspend, to work-around the fact that the kernel in build 1372 doesn't issue this masking command.

follow-up: ↓ 27   Changed 6 years ago by rsmith

As of firmware Q2D08a this bug should be squashed. But it still needs hours of verification on the test bed.

Also marvel is adding some specific test settings to the WLAN firmware that will allow us to tune an auto wlan wakeup such that the resulting SCI will arrive inside of the "danger" window. This way we can do an extended worst case test.

  Changed 6 years ago by dsaxena

  • cc dsaxena added

in reply to: ↑ 25   Changed 6 years ago by dsaxena

  • next_action set to never set

Replying to rsmith:

As of firmware Q2D08a this bug should be squashed. But it still needs hours of verification on the test bed. Also marvel is adding some specific test settings to the WLAN firmware that will allow us to tune an auto wlan wakeup such that the resulting SCI will arrive inside of the "danger" window. This way we can do an extended worst case test.

Hi, I'm going through various high/blocker kernel bugs and trying to assess their state and determine if we still need to attack them for 8.2.

Can we run this trough whatever formal testing it needs to or do we have enough field evidence of no reports in 5 months to close this as fixed?

  Changed 6 years ago by gregorio

  • cc gregorio added
  • milestone set to 8.2.0 (was Update.2)

Hi Deepak,

Do you have an example test? I don't know if I can find someone to try it but if we knew how to test that may help recruit someone.

I changed the milestone to 8.2.0. Let's decide if its closed and if not if its really a blocker for that release.

Thanks,

Greg S

  Changed 4 years ago by wad

  • status changed from reopened to closed
  • resolution set to fixed

I am proud to officially declare this ticket closed.

Note: See TracTickets for help on using tickets.