Ticket #7954 (new defect)

Opened 6 years ago

Last modified 5 years ago

Hang on suspend under high network load.

Reported by: cjb Owned by: dsaxena
Priority: blocker Milestone: 9.1.0-cancelled
Component: kernel Version: not specified
Keywords: blocks-:8.2.0 blocks+:9.1.0 Cc: dilinger, dwmw2, jcardona, rsmith, mbletsas, ashish, gregorio
Action Needed: diagnose Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Hi,

Joe and I had ten laptops each (him running joyride-2294, me running joyride-2295) performing activity updates today. Some of mine had trouble associating, so I think we managed to degrade the wifi spectrum at 1cc (media lab 802.11 AP). One of mine and four of his laptops crashed (completely hung, screen dimmed, power light on solid) either going into or coming out of suspend. We suspect a libertas bug, although it's hard to perform a meaningful diagnosis as none of the machines have a serial console. I can take one of them, add serial and see if there's any response on it if that would help.

This sounds like it could be extremely difficult to debug/reproduce. How should we start?

Change History

follow-up: ↓ 3   Changed 6 years ago by dsaxena

First thing to do is to definetily connect a serial port to each machine (in my opinion ALL QA-level tests should be done with boards that have serial ports that are logged) and reproduce the exact same setup you were running in hopes of triggering the bug and capture the data. All machines should boot with no_console_suspend, and have /sys/proc/kernel/printk set to 9.

My guess is that it is something similar to #7458, but triggered in a real usage scenario. :(

follow-up: ↓ 4   Changed 6 years ago by wad

  • cc rsmith added

Time to revive the Suspend/Resume testbed (which provides serial logging for twenties of machines).

in reply to: ↑ 1   Changed 6 years ago by rsmith

Replying to dsaxena:

First thing to do is to definetily connect a serial port to each machine (in my opinion ALL QA-level tests should be done with boards that have serial ports that are logged) and reproduce

You can't do just one side. You have to test both. The timing difference between no_console_suspend @ loglevel 9 and with console suspend @ normal loglevel is quite large. So all 1st and 2nd level QA tests can have full serial enabled but before it goes out the door you have to repeat everything with all debugging turned off.

That or ship it with debugging enabled. :)

in reply to: ↑ 2   Changed 6 years ago by rsmith

Replying to wad:

Time to revive the Suspend/Resume testbed (which provides serial logging for twenties of machines).

We have several machines with serial ports and I've been slowly adding EC serial connectors as well. cjb come grab me today and we can setup the tests on the machines I have.

  Changed 6 years ago by kimquirk

  • keywords blocks?:8.2 added

Would like to see how this works with new firmware. I don't believe we will be adding serial connectors to the 100+ laptop testbed, but we should do that for the suspend/resume testbed.

follow-up: ↓ 8   Changed 6 years ago by cjb

Looks like we just saw a suspend crash on a machine running the new firmware in joyride-2323. :( No serial port on it.

  Changed 6 years ago by cjb

  • cc ashish added

in reply to: ↑ 6   Changed 6 years ago by tomeu

Replying to cjb:

Looks like we just saw a suspend crash on a machine running the new firmware in joyride-2323. :( No serial port on it.

Sounds like #8143 ?

  Changed 6 years ago by cjb

Yeah, agreed.

  Changed 6 years ago by mstone

  • keywords blocks-:8.2.0 added; blocks?:8.2 removed
  • next_action changed from never set to diagnose
  • milestone changed from 8.2.0 (was Update.2) to 9.1.0

We disabled idlesuspend in order to be able to downgrade the priority of this ticket; however, it's still worth keeping in mind.

  Changed 5 years ago by dsaxena

  • cc gregorio added
  • keywords blocks+:9.1.0 added

I am going through all kernel bugs marked as 9.1 or future release and updating their status, next action, etc in preparation of 9.1 bug scrubbing and future release planning.

We need to start running suspend/resume stress tests again on joyride to see if we still hit this and #7458.

Note: See TracTickets for help on using tickets.