Ticket #7479 (new defect)

Opened 19 months ago

Last modified 8 weeks ago

Suspend failing due to EC warning

Reported by: cjb Owned by: dsaxena
Priority: blocker Milestone:
Component: kernel Version: Update.1
Keywords: Cc: rsmith, dilinger
Action Needed: diagnose Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Seen by the tinderbox's suspend tests, after 30 iterations:

[ 7718.043376] PM: Syncing filesystems ... done.
[ 7718.044006] PM: Preparing system for mem sleep
[ 7718.044279] Freezing user space processes ... (elapsed 0.10 seconds) done.
[ 7718.147318] Freezing remaining freezable tasks ... (elapsed 0.14 seconds) don
e.
[ 7718.286397] PM: Entering mem sleep
[ 7718.286411] Suspending console(s)
[ 7718.288083] ehci_bus_suspend: USBCMD: 0x10009
[ 7718.288099] ehci_bus_suspend: USBSTS: 0x8
[ 7718.288115] ehci_bus_suspend: USBCMD: 0x10009
[ 7718.288129] ehci_bus_suspend: USBSTS: 0x8
[ 7718.288834] ehci_bus_suspend: USBCMD: 0x10008
[ 7718.288848] ehci_bus_suspend: USBSTS: 0x1008
[ 7718.313483] ehci_pci_suspend: USBCMD: 0x10008
[ 7718.313499] ehci_pci_suspend: USBSTS: 0x1008
[ 7718.347360] olpc-ec:  running cmd 0x1c
[ 7718.350390] olpc-ec:  received 0x7f
[ 7718.350390] ------------[ cut here ]------------
[ 7718.350390] WARNING: at arch/x86/kernel/olpc-pm.c:313 olpc_ec_mask_unset+0x43/0x7c()
[ 7718.350390] Modules linked in: sg usb8xxx libertas usb_storage libusual ieee80211 ieee80211_crypt pcspkr mousedev ehci_hcd ohci_hcd cs5535_gpio sdhci mmc_core usbcore
[ 7718.350390] Pid: 1528, comm: bash Not tainted 2.6.25-20080710.3.olpc.14813f826d6ccaf #1
[ 7718.350390]  [<c041492b>] warn_on_slowpath+0x40/0x4f
[ 7718.350390]  [<c05e5db2>] __down_trylock+0x3a/0x43
[ 7718.350390]  [<c05eb7e7>] __down_failed_trylock+0x7/0xc
[ 7718.350390]  [<c05ebc47>] _spin_unlock+0xf/0x22
[ 7718.350390]  [<c04154b0>] vprintk+0x27d/0x2f7
[ 7718.350390]  [<c0429c7d>] tick_notify+0x1e5/0x1ef
[ 7718.350390]  [<c04da506>] delay_tsc+0x2c/0x43
[ 7718.350390]  [<c041553e>] printk+0x14/0x18
[ 7718.350390]  [<c05ebcb2>] _spin_unlock_irqrestore+0x14/0x29
[ 7718.350390]  [<c0409c05>] olpc_ec_cmd+0x17d/0x188
[ 7718.350390]  [<c0409fbb>] olpc_ec_mask_unset+0x43/0x7c
[ 7718.350390]  [<c040a11e>] olpc_fixup_sleep+0x12a/0x141
[ 7718.350390]  [<c040a347>] olpc_pm_enter+0x10/0x49
[ 7718.350390]  [<c0430b99>] suspend_devices_and_enter+0xd8/0x118
[ 7718.350390]  [<c0430d01>] enter_state+0x10a/0x179
[ 7718.350390]  [<c0430e00>] state_store+0x90/0xa8
[ 7718.350390]  [<c0430d70>] state_store+0x0/0xa8
[ 7718.350390]  [<c04d6623>] kobj_attr_store+0x18/0x1c
7718.350390]  [<c048b329>] sysfs_write_file+0x9f/0xd3
[ 7718.350390]  [<c048b28a>] sysfs_write_file+0x0/0xd3
[ 7718.350390]  [<c0456325>] vfs_write+0x83/0xfc
[ 7718.350390]  [<c04567e1>] sys_write+0x3c/0x63
[ 7718.350390]  [<c0403782>] sysenter_past_esp+0x5f/0x85
[ 7718.350390]  =======================
[ 7718.350390] ---[ end trace edd52afcef97ca53 ]---
[ 7718.350390] olpc-ec:  running cmd 0x1b
[ 7718.350390] olpc-ec:  sending cmd arg 0x3b
[ 7718.350390] olpc-ec:  running cmd 0x32
[ 7718.350390] olpc_do_sleep!

On the console, we see:

[1215734390.02] -bash-3.2# echo mem > /sys/power/state
[1215734390.04] -bash: echo: write error: Device or resource busy

(It's possible that something else is causing the write error, but I can't see what it might be. Full dmesg uploaded.)

Attachments

suspend-fail.log (121.6 kB) - added by cjb 19 months ago.

Change History

Changed 19 months ago by cjb

  Changed 19 months ago by dsaxena

The warning is due to the following:

        /* the high bit is unused, if it is ever unset, that is a good sign
           sign of EC communication corruption! */
        WARN_ON(!(byte & 0x80));

Looking at the logs, we are receiving 0x7f on issuing the READ_SCI_MASK. After we receive this, we go ahead and run WRITE_SCI_MASK with a value of 0x3b, go to sleep, and then get woken back up. At this point, we do another READ_SCI_MASK and hit the WARN_ON() again since the high but is unset. Richard, Andres, from the WARN_ON() I assume we've seen this before...what caused it in the past?

From the log, it looks like even with the warning we're suspending/resuming fine. From a quick walk through the code path, the only -EBUSY I see is the following snippet:

        if (!mutex_trylock(&pm_mutex))
                return -EBUSY;

I'll dig more to see if there are any other -EBUSYs in the code path.

Another oddity from my POV, but might be expected behaviour since I'm still figuring out how all our bits work, is "olpm-pm: PM_PWRBTN wakeup event received". Given that this is on tinderbox, running automatic S/R cycles, I'm surprised to see power button events.

follow-up: ↓ 3   Changed 19 months ago by cjb

Given that this is on tinderbox, running automatic S/R cycles, I'm surprised to see power button events.

I can explain that part. The tinderbox is hooked up to a power measurement device that also has general purpose I/O lines, and one of them has been hooked up to the power button's trigger, meaning that I can artificially hit the power button in the tinderbox code.

I think I know more about the warning now, too -- I remember Richard complaining about EC mask corruption a week or so ago, and believing that it was his fault somehow.

What I can't explain is the EBUSY. It does look like it could be that mutex line, given the top three functions in the stack trace:

[ 7718.350390]  [<c041492b>] warn_on_slowpath+0x40/0x4f
[ 7718.350390]  [<c05e5db2>] __down_trylock+0x3a/0x43
[ 7718.350390]  [<c05eb7e7>] __down_failed_trylock+0x7/0xc

in reply to: ↑ 2   Changed 19 months ago by dsaxena

Replying to cjb:

What I can't explain is the EBUSY. It does look like it could be that mutex line, given the top three functions in the stack trace: {{{ [ 7718.350390] [<c041492b>] warn_on_slowpath+0x40/0x4f [ 7718.350390] [<c05e5db2>] down_trylock+0x3a/0x43 [ 7718.350390] [<c05eb7e7>] down_failed_trylock+0x7/0xc }}}

This stack trace is a bit odd in general:

[ 7697.409260] Pid: 1528, comm: bash Not tainted 2.6.25-20080710.3.olpc.14813f826d6ccaf #1
[ 7697.409260]  [<c041492b>] warn_on_slowpath+0x40/0x4f           # Warning message printed here
[ 7697.409260]  [<c05e5db2>] __down_trylock+0x3a/0x43
[ 7697.409260]  [<c05eb7e7>] __down_failed_trylock+0x7/0xc         
[ 7697.409260]  [<c05ebc47>] _spin_unlock+0xf/0x22                #
[ 7697.409260]  [<c04154b0>] vprintk+0x27d/0x2f7                  # Other things happening here
[ 7697.409260]  [<c04da506>] delay_tsc+0x2c/0x43                  # What exactly?
[ 7697.409260]  [<c04da506>] delay_tsc+0x2c/0x43
[ 7697.409260]  [<c041553e>] printk+0x14/0x18
[ 7697.409260]  [<c05ebcb2>] _spin_unlock_irqrestore+0x14/0x29
[ 7697.409260]  [<c0409c05>] olpc_ec_cmd+0x17d/0x188               # WARN_ON() call is here
[ 7697.409260]  [<c040a178>] olpc_ec_mask_set+0x43/0x7a
[ 7697.409260]  [<c040a36d>] olpc_pm_enter+0x36/0x49
[ 7697.409260]  [<c0430b99>] suspend_devices_and_enter+0xd8/0x118
[ 7697.409260]  [<c0430d01>] enter_state+0x10a/0x179
[ 7697.409260]  [<c0430e00>] state_store+0x90/0xa8
[ 7697.409260]  [<c0430d70>] state_store+0x0/0xa8
[ 7697.409260]  [<c04d6623>] kobj_attr_store+0x18/0x1c
[ 7697.409260]  [<c048b329>] sysfs_write_file+0x9f/0xd3
[ 7697.409260]  [<c048b28a>] sysfs_write_file+0x0/0xd3
[ 7697.409260]  [<c0456325>] vfs_write+0x83/0xfc
[ 7697.409260]  [<c04567e1>] sys_write+0x3c/0x63
[ 7697.409260]  [<c0403782>] sysenter_past_esp+0x5f/0x85

I'm going to commit a change to testing repo to enable CONFIG_DEBUG_MUTEX so we can get more information next time we see this in testing.

  Changed 19 months ago by gregorio

  • milestone deleted

Milestone Never Assigned deleted

  Changed 8 weeks ago by yade

Note: See TracTickets for help on using tickets.