Ticket #9565 (closed defect: fixed)

Opened 4 years ago

Last modified 4 years ago

system freeze under os33

Reported by: pgf Owned by:
Priority: blocker Milestone:
Component: not assigned Version: not specified
Keywords: Cc: dsaxena, mtd
Action Needed: diagnose Verified: no
Deployments affected: Blocked By:
Blocking:

Description

i've been testing suspend/resume from linux. several times i've experienced full system lockups -- no mouse, no keyboard, no serial console, no response of any sort.

jnetlett reports seeing the same symptom -- his theory: "... a full system freeze if I stress the PCI bus."

Change History

  Changed 4 years ago by pgf

another possibility -- we're spinning with continuous SCI interrupts? twice, while acpi debug tracing has been on, i've experienced lockups.

first time it was continuous like this:

[ 1504.061847]  evevent-0249 [00] ev_fixed_event_detect : Fixed Event Block: Enable 00000100 Status 00000801
[ 1504.072072]    evgpe-0445 [00] ev_gpe_detect         : Read GPE Register at GPE0: Status=00, Enable=00
[ 1504.082025]    evgpe-0445 [00] ev_gpe_detect         : Read GPE Register at GPE8: Status=00, Enable=08
[ 1504.092005] olpc-dcon: scanline interrupt w/CPU

second time like this:

[ 1504.061847]  evevent-0249 [00] ev_fixed_event_detect : Fixed Event Block: Enable 00000100 Status 00000801
[ 1504.072072]    evgpe-0445 [00] ev_gpe_detect         : Read GPE Register at GPE0: Status=00, Enable=00
[ 1504.082025]    evgpe-0445 [00] ev_gpe_detect         : Read GPE Register at GPE8: Status=00, Enable=08
[ 1504.092005] olpc-dcon: scanline interrupt w/CPU

we wouldn't get the dcon message if acpi had handled the interrupt (and it's not for dcon either). this implies an asserted interrupt from elsewhere.

one obvious interrupt source that would cause SCI but not be shown in those registers is SMBALRT#. this source is enabled, because DCONIRQ uses it on B2. i'll do some probing.

  Changed 4 years ago by pgf

SMBALRT (see above) doesn't seem to be the problem.

however, i can often get the system by doing several suspend/resume cycles ("echo mem > /sys/power/state", followed by a power button push), and then touching the touchpad.

(also, unsaid above: /var/log/messages contains nothing past what the console log emitted before stopping.)

  Changed 4 years ago by pgf

and, for the record, removing the wireless card doesn't affect the problem.

  Changed 4 years ago by reuben

  Changed 4 years ago by pgf

as in #9581, X is taking almost all the CPU at the time of the lockup. in this case, in at least one instance, there was no explicit X movement during the "awake" period. in another case, only the X11 cursor was moved.

(note, however, that the X11 cursor does move on its own during the suspend -- see #9561)

  Changed 4 years ago by pgf

if anything, this lockup (or, perhaps, a very similar lockup) may be easier to recreate without X running than with.

follow-up: ↓ 8   Changed 4 years ago by pgf

  • cc dsaxena added

if i revert 3517a0561e15c68734bb9dc59405030f66450a23, i can no longer lock the system (either with or without X) with simple suspend/resume cycles. (the two 'count' args to memcpy_fromio() seem odd.)

i _can_ still lock the system as in #9581.

i _can_ still lock the system as in #9420. the difference in that case (in my testing -- i can't really speak for cjb) is that the i reproduce #9420 with a loop:

    while :
    do
       echo mem > /sys/power/state
    done

whereas this bug i reproduce by hand. my suspicion lies in the fact that the console session is terminated and restarted across the suspend/resume. the loop which reproduces #9420 attempts to keep running across the s/r cycles. do we understand why the console session restarts?

in reply to: ↑ 7   Changed 4 years ago by dsaxena

Replying to pgf:

if i revert 3517a0561e15c68734bb9dc59405030f66450a23, i can no longer lock the system (either with or without X) with simple suspend/resume cycles. (the two 'count' args to memcpy_fromio() seem odd.)

Yep, looks like a buglet, I'm copying 0xff fromio but writing back 0x100 toio.

i _can_ still lock the system as in #9581.

i _can_ still lock the system as in #9420. the difference in that case (in my testing -- i can't really speak for cjb) is that the i reproduce #9420 with a loop: {{{ while : do echo mem > /sys/power/state done }}} whereas this bug i reproduce by hand. my suspicion lies in the fact that the console session is terminated and restarted across the suspend/resume. the loop which reproduces #9420 attempts to keep running across the s/r cycles. do we understand why the console session restarts?

I've opened a separate bug for that issue, #9584

  Changed 4 years ago by pgf

it seems that reverting the above-mentioned commit was not the fix (though i've pushed the obvious fix to that commit in any case).

  Changed 4 years ago by dsaxena

  • status changed from new to closed
  • next_action changed from never set to test in build
  • resolution set to fixed

Fixed in kernel commit d05d8854653f8ddc86388bc90e63fe85a23360bb

  Changed 4 years ago by pgf

in parallel, #9581 was closed as a dup of this bug. deepak -- if you don't think all of the symptoms of #9581 have been fixed, could you reopen it, please?

  Changed 4 years ago by pgf

  • status changed from closed to reopened
  • resolution deleted

this seems to be a separate bug than #9581, so reopening.

also, the "move start-sugar icon quickly" test, which locks up the screen, seems to lock things up when you hit the left edge of the screen with the icon, and not before.

  Changed 4 years ago by Quozl

  • next_action changed from test in build to diagnose

  Changed 4 years ago by cjb

Current theory on the left-edge-of-the-screen graphics hang, before the first suspend/resume, is that it's an openchrome driver bug -- Jon N says that it doesn't happen with the via driver, and he's on top of it. Go him!

  Changed 4 years ago by mtd

  • cc mtd added

  Changed 4 years ago by Quozl

reproduced hang on os36 with "move start-sugar icon quickly" test.

  Changed 4 years ago by Quozl

  • blocking 9642 added

  Changed 4 years ago by Quozl

  • blocking 9642 removed

  Changed 4 years ago by Quozl

fixed in os40, "move start-sugar icon quickly" tests now pass.

was there anything else in this ticket that is yet to be fixed?

  Changed 4 years ago by reuben

In OS 42, I had a freeze while watching an web page animation. I can also consistently lock up the system when using Wine, as documented here:

http://dev.laptop.org/ticket/9581

  Changed 4 years ago by Quozl

  • status changed from reopened to closed
  • resolution set to fixed

retriage. tracking remaining issue in #9581

  Changed 4 years ago by anonymous

  • milestone deleted

Milestone 1.5-software deleted

Note: See TracTickets for help on using tickets.