Ticket #12039 (closed defect: fixed)

Opened 21 months ago

Last modified 18 months ago

Botched resume +X on XO-1.5

Reported by: dsd Owned by: dsd
Priority: normal Milestone: 13.1.0
Component: kernel Version: not specified
Keywords: Cc: greenfeld
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Looking at old IRC logs it seems like the bug where the XO-1.5 occasionally performs a full reboot on resume after printing "+X" is not an unknown issue.

But this issue is still present today, and we have a feeling that it might be more common than before (with new kernel, firmware, motherboard revision?).

Here's how it appears:

screenlog.3-4ddc646-reboots-[ 2007.229908] libertas_sdio mmc1:0001:1: Suspend without wake params -- powering down card
screenlog.3-4ddc646-reboots-[ 2007.250703] mmc1: card 0001 removed
screenlog.3-4ddc646-reboots-[ 2007.320139] PM: suspend of devices complete after 224.434 msecs
screenlog.3-4ddc646-reboots-[ 2007.326607] ehci_hcd 0000:00:10.4: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2007.350165] uhci_hcd 0000:00:10.2: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2007.370170] uhci_hcd 0000:00:10.1: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2007.390159] uhci_hcd 0000:00:10.0: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2007.410391] PM: late suspend of devices complete after 84.203 msecs
screenlog.3-4ddc646-reboots-[ 2007.416905] ACPI: Preparing to enter system sleep state S3
screenlog.3-4ddc646-reboots-[ 2007.520175] PM: Saving platform NVS memory
screenlog.3-4ddc646-reboots:+r+r+X+
screenlog.3-4ddc646-reboots-Forthmacs 
screenlog.3-4ddc646-reboots-Type 'i' to interrupt stand-init sequence
screenlog.3-4ddc646-reboots-Unknown value in TS tag
screenlog.3-4ddc646-reboots-USB2 devices:
screenlog.3-4ddc646-reboots-USB1 devices:
screenlog.3-4ddc646-reboots-OLPC D5, 1 GiB memory installed, 4 GB internal storage, S/N SHC01901E0B
screenlog.3-4ddc646-reboots-OpenFirmware  Q3C07   EC Firmware Ver:2.2.10   2012-08-02 16:36:31 UTC
screenlog.3-4ddc646-reboots-
--
screenlog.3-4ddc646-reboots-[ 2009.314240] libertas_sdio mmc1:0001:1: Suspend without wake params -- powering down card
screenlog.3-4ddc646-reboots-[ 2009.330793] mmc1: card 0001 removed
screenlog.3-4ddc646-reboots-[ 2009.400142] PM: suspend of devices complete after 224.617 msecs
screenlog.3-4ddc646-reboots-[ 2009.406586] ehci_hcd 0000:00:10.4: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2009.430167] uhci_hcd 0000:00:10.2: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2009.450174] uhci_hcd 0000:00:10.1: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2009.470161] uhci_hcd 0000:00:10.0: wake-up capability enabled by ACPI
screenlog.3-4ddc646-reboots-[ 2009.490395] PM: late suspend of devices complete after 84.229 msecs
screenlog.3-4ddc646-reboots-[ 2009.496925] ACPI: Preparing to enter system sleep state S3
screenlog.3-4ddc646-reboots-[ 2009.600176] PM: Saving platform NVS memory
screenlog.3-4ddc646-reboots:+r+X+
screenlog.3-4ddc646-reboots-Forthmacs 
screenlog.3-4ddc646-reboots-Type 'i' to interrupt stand-init sequence
screenlog.3-4ddc646-reboots-Unknown value in TS tag
screenlog.3-4ddc646-reboots-USB2 devices:
screenlog.3-4ddc646-reboots-USB1 devices:
screenlog.3-4ddc646-reboots-OLPC D5, 1 GiB memory installed, 4 GB internal storage, S/N SHC01901E0B
screenlog.3-4ddc646-reboots-OpenFirmware  Q3C07   EC Firmware Ver:2.2.10   2012-08-02 17:24:59 UTC

Sam has managed to reproduce this several times in the last few hours over a group of 5 laptops. I also hit it once this morning with 1 laptop. My test case (which is practically the same as what sam is doing) is:

  1. Boot 12.1.0 build 18
  2. Update to latest x86-3.3 kernel
  3. Modify runin code to disable the X part
  4. touch /runin/{soiled,aggressive,no-camera,force}
  5. reboot

Mitch, when back from bringup and when you aren't flooded with high-priority tasks, perhaps we could run a debugging session to get a better understanding of this issue and maybe a fix.

Change History

Changed 21 months ago by dsd

I can only reproduce this issue on one of my 2 test systems: SHC01601310

wad suggested on IRC that this may be due to a device not being fully shut down during suspend. Any access to main memory (which could happen from DMA-using devices) will cause memory rot if done while the RAM is in self-refresh during suspend.

I've reproduced it on the 3.3 kernel trimmed down, with all of the following excluded: framebuffer, dcon, usb, mmc, wifi, mouse, sound, ...

And it reproduces in tiny core linux (with rtcwake added).

Here is an archive that can be extracted onto a USB disk to replicate this setup: http://dev.laptop.org/~dsd/20120804/12039.zip

It boots and then starts a repeated suspend/resume cycle. Remember to attach serial as you won't get anything on screen. And unplug the USB disk after booting so that its obvious when the system has rebooted coming out of suspend.

Now I'll attempt to reproduce with old kernels and firmware versions.

Changed 21 months ago by dsd

  • owner changed from wmb@… to dsd
  • status changed from new to assigned
  • component changed from ofw - open firmware to kernel

This is something kernel related - or is at least triggered that way. Will see what we can learn on this front before bugging Mitch.

Reproduced on olpc-2.6.35 with this config: http://dev.laptop.org/~dsd/20120806/config.12039.bad

but not with this one (which we used for 11.3): http://dev.laptop.org/~dsd/20120806/config.12039.good

now to bisect the differences.

Changed 21 months ago by dsd

On 2.6.35, the +X issue readily manifests when no video=viafb line is present. It doesn't appear if "video=viafb:viafb_mode=1200x900,viafb_bpp=24,viafb_active_dev=LCD,viafb_lcd_panel_id=23" is set.

Bisecting that: "video=viafb:viafb_active_dev=LCD,viafb_lcd_panel_id=23,viafb_bpp=24" still causes the crash

"video=viafb:viafb_mode=1200x900" seems to avoid the crash but always causes a hang during resume after second 371 as logged in dmesg (even on multiple laptops). Trying to get a handle on this hang...

Changed 21 months ago by dsd

Difficult to catch because of the console sem being held by this code. The hang is:

<1>[  506.371493] BUG: unable to handle kernel paging request at ec0cc000
<1>[  506.379183] IP: [<b055f4ce>] viafb_cursor+0x3c9/0x433
<4>[  506.380007] *pde = 384a3067 *pte = 00000000 
<0>[  506.380007] Oops: 0002 [#1] PREEMPT 
<0>[  506.380007] last sysfs file: /sys/power/state
call hw_breakpoint_exceptions_notify
call kprobe_exceptions_notify
call trace_die_handler
<4>[  506.380007] Modules linked in:
<4>[  506.380007] 
<4>[  506.380007] Pid: 407, comm: async/10 Not tainted 2.6.35.13+ #150 XO/XO
<4>[  506.380007] EIP: 0060:[<b055f4ce>] EFLAGS: 00010286 CPU: 0
<4>[  506.380007] EIP is at viafb_cursor+0x3c9/0x433
<4>[  506.380007] EAX: ec0cc000 EBX: e9bbfbc8 ECX: 00000800 EDX: e92aaa54
<4>[  506.380007] ESI: ea536000 EDI: ec0cc000 EBP: e9bbfb90 ESP: e9bbfb40
<4>[  506.380007]  DS: 0068 ES: 007b FS: 0000 GS: 0000 SS: 0068
<0>[  506.380007] Process async/10 (pid: 407, ti=e9bbf000 task=e84ca5e0 task.ti=e9bbf000)
<0>[  506.380007] Stack:
<4>[  506.380007]  b053b8af 00000046 b0932390 e9bafb00 e84a0600 e9bafae0 00000000 e9bb0000
<4>[  506.380007] <0> 00000000 e84a0400 00000282 e92aa270 00000046 00000020 e92aa000 0000000c
<4>[  506.380007] <0> ea534000 e84a0000 ea408600 e851feae e9bbfc1c b0559c74 ffff00ff 00000000
<0>[  506.380007] Call Trace:
<4>[  506.380007]  [<b053b8af>] ? __debug_check_no_obj_freed+0xe6/0x139
<4>[  506.380007]  [<b0559c74>] ? bit_cursor+0x42b/0x441
<4>[  506.380007]  [<b070bbe0>] ? sub_preempt_count+0x9/0x92
<4>[  506.380007]  [<b04037b4>] ? do_IRQ+0x7b/0x8e
<4>[  506.380007]  [<b0559849>] ? bit_cursor+0x0/0x441
<4>[  506.380007]  [<b0557383>] ? fbcon_cursor+0x208/0x22c
<4>[  506.380007]  [<b0559849>] ? bit_cursor+0x0/0x441
<4>[  506.380007]  [<b05a8191>] ? set_cursor+0x58/0x64
<4>[  506.380007]  [<b05a83c2>] ? redraw_screen+0x167/0x1d2
<4>[  506.380007]  [<b0558f23>] ? fbcon_event_notify+0xad/0x4a1
<4>[  506.380007]  [<b070bd7f>] ? notifier_call_chain+0x79/0x9f
<4>[  506.380007]  [<b0438ec2>] ? sched_clock_local.constprop.2+0xe/0x138
<4>[  506.380007]  [<b04377e1>] ? hrtimer_forward+0x103/0x12a
<4>[  506.380007]  [<b043b50f>] ? ktime_get+0x64/0xf5
<4>[  506.380007]  [<b040e815>] ? lapic_next_event+0x1b/0x1f
<4>[  506.380007]  [<b043ebbf>] ? clockevents_program_event+0xba/0xd0
<4>[  506.380007]  [<b04060d9>] ? native_sched_clock+0x3d/0x63
<4>[  506.380007]  [<b05343a2>] ? rb_insert_color+0x9c/0xc0
<4>[  506.380007]  [<b04060d9>] ? native_sched_clock+0x3d/0x63
<4>[  506.380007]  [<b04060d9>] ? native_sched_clock+0x3d/0x63
<4>[  506.380007]  [<b0535276>] ? put_dec+0x2a/0x74
<4>[  506.380007]  [<b0535f89>] ? number.isra.3+0x114/0x1e4
<4>[  506.380007]  [<b04377e1>] ? hrtimer_forward+0x103/0x12a
<4>[  506.380007]  [<b04060d9>] ? native_sched_clock+0x3d/0x63
<4>[  506.380007]  [<b0536b61>] ? vsnprintf+0x20f/0x266
<4>[  506.380007]  [<b05369c6>] ? vsnprintf+0x74/0x266
<4>[  506.380007]  [<b070bbe0>] ? sub_preempt_count+0x9/0x92
<4>[  506.380007]  [<b0709743>] ? _raw_spin_unlock_irqrestore+0x1c/0x31
<4>[  506.380007]  [<b041b262>] ? get_parent_ip+0xb/0x31
<4>[  506.380007]  [<b070bc4c>] ? sub_preempt_count+0x75/0x92
<4>[  506.380007]  [<b0421794>] ? vprintk+0x318/0x34e
<4>[  506.380007]  [<b0438752>] ? __blocking_notifier_call_chain+0x37/0x4f
<4>[  506.380007]  [<b043877b>] ? blocking_notifier_call_chain+0x11/0x13
<4>[  506.380007]  [<b054d592>] ? fb_notifier_call_chain+0x16/0x18
<4>[  506.380007]  [<b054eda2>] ? fb_set_suspend+0x51/0x5f
<4>[  506.380007]  [<b05676d2>] ? viafb_resume+0xc10/0xc36
<4>[  506.380007]  [<b0566191>] ? via_resume+0x39/0x50
<4>[  506.380007]  [<b0547f74>] ? pci_pm_resume+0x8d/0xa0
<4>[  506.380007]  [<b05bda4d>] ? pm_op+0x48/0x7c
<4>[  506.380007]  [<b05bdcf2>] ? device_resume+0x4a/0xc9
<4>[  506.380007]  [<b05bdd8c>] ? async_resume+0x1b/0x3f
<4>[  506.380007]  [<b0439bf1>] ? async_thread+0xe5/0x1c5
<4>[  506.380007]  [<b0709755>] ? _raw_spin_unlock_irqrestore+0x2e/0x31
<4>[  506.380007]  [<b041d9ed>] ? default_wake_function+0x0/0x12
<4>[  506.380007]  [<b0439b0c>] ? async_thread+0x0/0x1c5
<4>[  506.380007]  [<b0434603>] ? kthread+0x64/0x69
<4>[  506.380007]  [<b043459f>] ? kthread+0x0/0x69
<4>[  506.380007]  [<b0402c36>] ? kernel_thread_helper+0x6/0x10
<0>[  506.380007] Code: 39 c2 7c e3 8b 75 dc b9 00 08 00 00 a1 b4 23 96 b0 8b 56 20 8b 80 50 02 00 00 8b 75 f0 03 82 f0 00 00 00 81 c6 00 20 00 00 89 c7 <f3> a5 8b 45 f0 e8 69 16 f3 ff eb 33 8b 7d f0 0f b6 34 0f 41 c7 
<0>[  506.380007] EIP: [<b055f4ce>] viafb_cursor+0x3c9/0x433 SS:ESP 0068:e9bbfb40
<0>[  506.380007] CR2: 00000000ec0cc000
<4>[  506.380007] ---[ end trace 8bc2b11483f8d0ef ]---

Changed 21 months ago by dsd

That hang was a memory leak, hitting death after a certain number of suspends. Fixed in 2.6.37 by a commit titled "viafb: fix hardware acceleration for suspend & resume"

Tested 2.6.35 without viafb compiled in; it also crashes.

So the conclusion from the 2.6.35 tests seems to indicate that the +X crash case is present in the system "by default", but gets fixed by viafb when enabled *and* configured especially for the XO screen.

Vanilla 2.6.37 with correctly configured viafb doesn't seem to crash. Vanilla 3.3 does (and now auto-configured viafb based on detecting the XO-1.5). Will start a bisection here.

Changed 21 months ago by dsd

bad: [b715ce947f51c6637e78b262501f0c4ff9d848cc] tg3: Fix link down notify failure when EEE disabled

Changed 21 months ago by dsd

good: [e16b396ce314b2bcdfe6c173fe075bf8e3432368] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

Changed 21 months ago by dsd

bad: [f4b10bc60a310916bab5413f821b99ef845cac17] Merge branch 'kvm-updates/2.6.40' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Changed 20 months ago by dsd

The bisection was made difficult because of viafb breaking on XO-1.5 during various points in this history. Anyway, with some manual intervention I'm now down to just a handful of potentially responsible commits:

good: c5a4e6d725654045e548b1a1f33059a44a06910f bad: f33f6f0ccc5ff7415da3feb18b1ff966ed2d80d1

Changed 20 months ago by dsd

good: 75ec72f8c53b9f981e68704432a2e425a01f79b7 (survived overnight test)

Changed 20 months ago by dsd

good: c5593d26aea3aba2a7703a0b5d74b4ea8b726889

bad: 2946294f9aa734efc5873ea2f34131d0a8c0f89a

At this point, extra commits must be added to every test to fix viafb on XO-1.5.

git cherry-pick b4ce6a285b65be4fb858728b3bbe9011242b769f
git cherry-pick 32fab7bcc79ee0b97277627f456c94202858d851
git cherry-pick c5a4e6d725654045e548b1a1f33059a44a06910f

Changed 20 months ago by dsd

Found it, caused by b692a63af8b63a7a7e84702a713d0072e336b326

viafb: add VIA slapping capability

Changed 19 months ago by dsd

  • next_action changed from never set to add to build
  • milestone changed from Not Triaged to 13.1.0

http://marc.info/?l=linux-fbdev&m=134530521717042&w=2

Patch accepted upstream

Author: Daniel Drake <dsd@laptop.org>
Date:   Tue Sep 4 09:35:40 2012 -0600

    viafb: don't touch clock state on OLPC XO-1.5

pushed to x86-3.3 as 3ef9f18dfe5

Changed 19 months ago by pbrobinson

  • next_action changed from add to build to test in build

should be in os4

Changed 18 months ago by greenfeld

  • status changed from assigned to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

I ran a modified version of runin (with the settings in this ticket) on a XO-1.5 this week and could not reproduce the issue with 13.1.0 os5.

Note: See TracTickets for help on using tickets.