Ticket #11357 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

Boot freezing on third clock dot

Reported by: tonyforster Owned by: saadia
Priority: blocker Milestone: 1.75-software
Component: kernel Version: Development build as of this date
Keywords: Cc: dsd, greenfeld
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

XO-1.75 OS882

Since changing to OS882 I have had a number of hung boots on the third dot of the clock. I suspect that this might be related to #11211 because it got a lot worse at the same time.

I have not been able to catch it with a noisy boot (tick)

Attachments

DSCN0273.JPG.jpe (1.8 MB) - added by tonyforster 2 years ago.
hung noisy boot
failedinit.log (19.2 kB) - added by greenfeld 2 years ago.
Log of failed initialization; modprobe scsi_wait_scan a possible culprit
bug11357-cap1.txt (23.3 kB) - added by saadia 2 years ago.
bug11357-cap2.txt (60.1 kB) - added by saadia 2 years ago.
Another repetition of the bug with keyboard debug flag on
0001-olpc-ec-1-75-avoid-power_supply_get_by_name-in-IRQ-c.patch (2.4 kB) - added by dsd 2 years ago.
screenlog.0 (24.6 kB) - added by godiard 2 years ago.
thirddothang.log (57.8 kB) - added by greenfeld 2 years ago.
Log with information requested by comment:17
screenlog.20111107.1918 (300.1 kB) - added by godiard 2 years ago.
psinithang-lockdebug.txt (52.0 kB) - added by greenfeld 2 years ago.
Log of this hang using a kernel with lock debugging enabled to get more info
FAIL.LOG (1.1 MB) - added by garysu 2 years ago.
psmouse.c: resync failed, issuing reconnect request
fa.log (192.8 kB) - added by garysu 2 years ago.
log with enable olpc-kbd debugging
log.zip (42.1 kB) - added by garysu 2 years ago.
all logs in /var/log, fs.log = run dmesg, for lost touchpad function

Change History

Changed 2 years ago by tonyforster

hung noisy boot

Changed 2 years ago by greenfeld

  • next_action changed from never set to diagnose
  • cc dsd added
  • component changed from not assigned to kernel
  • priority changed from normal to high
  • version changed from not specified to Development build as of this date
  • milestone changed from Not Triaged to 1.75-software

I have seen this as well a lot in OLPC-SF.

Mostly it seems to happen if the XO has not been used for a few hours, although I showed dsd a loop where I got it to happen a few times, with the following boot failing to initialize the touchpad. Removing the psmouse module and reinserting it solves the subsequent mouse issue.

Reproduction (?):

  1. Power off a B1 XO-1.75 w/ HS keyboard/AVC, no need to remove the battery although I've also tried that.
  2. Let it sit for a few hours.
  3. Turn the XO on and try to let it boot.

Need to reproduce with a serial adapter attached.

Changed 2 years ago by greenfeld

Log of failed initialization; modprobe scsi_wait_scan a possible culprit

Changed 2 years ago by greenfeld

  • cc greenfeld added

Changed 2 years ago by saadia

  • owner set to saadia

I have seen this happen only on my "crunchy" B1, i.e. SKU 199. Paul may have seen the same phenomenon on his crunchy B1. When additional info is added, please specify which B1 you have. I am able to narrow the window for when this problem began to the time between 881 and October 18th.

Changed 2 years ago by tonyforster

My B1 has the clicky HS keyboard

Changed 2 years ago by dsd

Next time this happens please leave the system for 10 minutes or so, giving the kernels lockup detection a chance to kick in. If nothing gets spewed over serial in that time, please run:

echo  't' > /proc/sysrq-trigger

Changed 2 years ago by saadia

Changed 2 years ago by saadia

I reproduced this, waited 10 minutes and than ran the above command. Output is attached in bug11357-cap1.txt

Changed 2 years ago by dsd

Thanks Saadia. As you note, the bug is that we have a hang here: psmouse_connect->psmouse_initialize->psmouse_set_rate->ps2_command

scsi-wait-scan was simply the first thing to choke on the fact that our mouse init is screwing up. I just committed a patch that will allow us to get further info on this. See instructions in #11363.

Changed 2 years ago by saadia

Another repetition of the bug with keyboard debug flag on

Changed 2 years ago by dsd

please enable in your kernel config:

CONFIG_DEBUG_RT_MUTEXES
CONFIG_DEBUG_SPINLOCK
CONFIG_DEBUG_MUTEXES
CONFIG_DEBUG_LOCK_ALLOC
CONFIG_PROVE_LOCKING

Then when it next hangs, if it doesn't spew anything new automatically, run:

echo d > /proc/sysrq-trigger

Changed 2 years ago by dsd

Saadia tested and found that lockdep is unhappy indeed (even on successful boots):

[  714.756861] =================================
[  714.756866] [ INFO: inconsistent lock state ]
[  714.766984] 3.0.0-00173-gb67f6bf-dirty #671
[  714.771134] ---------------------------------
[  714.771134] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[  714.781424] swapper/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[  714.781424]  (&(&k->k_lock)->rlock){?.+...}, at: [<c036d950>] klist_next+0x18/0xb4
[  714.786350] {HARDIRQ-ON-W} state was registered at:
[  714.793913]   [<c006ef6c>] __lock_acquire+0x60c/0x1724
[  714.803862]   [<c00704f0>] lock_acquire+0x60/0x74
[  714.803862]   [<c0377eb8>] _raw_spin_lock+0x40/0x50
[  714.808531]   [<c036d950>] klist_next+0x18/0xb4
[  714.813392]   [<c01c2da0>] bus_for_each_dev+0x58/0x84
[  714.817897]   [<c01c3530>] bus_add_driver+0xbc/0x250
[  714.822919]   [<c01c43f8>] driver_register+0xa8/0x138
[  714.832878]   [<c002b3dc>] do_one_initcall+0x90/0x164
[  714.837911]   [<c00089a8>] kernel_init+0x74/0x110
[  714.842588]   [<c0031cb8>] kernel_thread_exit+0x0/0x8
[  714.847613] irq event stamp: 113584
[  714.847613] hardirqs last  enabled at (113581): [<c0031d18>] default_idle+0x24/0x2c
[  714.851067] hardirqs last disabled at (113582): [<c00308b4>] __irq_svc+0x34/0xac
[  714.858684] softirqs last  enabled at (113584): [<c004b1f0>] irq_enter+0x44/0x70
[  714.866033] softirqs last disabled at (113583): [<c004b1e4>] irq_enter+0x38/0x70
[  714.880739] 
[  714.880739] other info that might help us debug this:
[  714.887218]  Possible unsafe locking scenario:
[  714.887218] 
[  714.887225]        CPU0
[  714.893092]        ----
[  714.895513]   lock(&(&k->k_lock)->rlock);
[  714.897935]   <Interrupt>
[  714.901923]     lock(&(&k->k_lock)->rlock);
[  714.904516] 
[  714.908682]  *** DEADLOCK ***
[  714.908682] 
[  714.914554] no locks held by swapper/0.
[  714.918361] 
[  714.918361] stack backtrace:
[  714.918365] [<c003655c>] (unwind_backtrace+0x0/0x11c) from [<c0372edc>] (print_usage_bug.part.28+0x208/0x264)
[  714.922709] [<c0372edc>] (print_usage_bug.part.28+0x208/0x264) from [<c006e770>] (mark_lock+0x418/0x608)
[  714.932564] [<c006e770>] (mark_lock+0x418/0x608) from [<c006eee8>] (__lock_acquire+0x588/0x1724)
[  714.941986] [<c006eee8>] (__lock_acquire+0x588/0x1724) from [<c00704f0>] (lock_acquire+0x60/0x74)
[  714.950714] [<c00704f0>] (lock_acquire+0x60/0x74) from [<c0377eb8>] (_raw_spin_lock+0x40/0x50)
[  714.959530] [<c0377eb8>] (_raw_spin_lock+0x40/0x50) from [<c036d950>] (klist_next+0x18/0xb4)
[  714.968087] [<c036d950>] (klist_next+0x18/0xb4) from [<c01c485c>] (class_dev_iter_next+0x10/0x40)
[  714.976478] [<c01c485c>] (class_dev_iter_next+0x10/0x40) from [<c01c4e4c>] (class_find_device+0x8c/0xb4)
[  714.994731] [<c01c4e4c>] (class_find_device+0x8c/0xb4) from [<c02426e0>] (power_supply_get_by_name+0x1c/0x34)
[  715.004593] [<c02426e0>] (power_supply_get_by_name+0x1c/0x34) from [<c01cfd6c>] (olpc_ec_1_75_irq+0x254/0x388)
[  715.004593] [<c01cfd6c>] (olpc_ec_1_75_irq+0x254/0x388) from [<c008a548>] (handle_irq_event_percpu+0x30/0x174)
[  715.014541] [<c008a548>] (handle_irq_event_percpu+0x30/0x174) from [<c008a6c8>] (handle_irq_event+0x3c/0x5c)
[  715.024482] [<c008a6c8>] (handle_irq_event+0x3c/0x5c) from [<c008c2b0>] (handle_level_irq+0xb8/0xe8)
[  715.043331] [<c008c2b0>] (handle_level_irq+0xb8/0xe8) from [<c008a100>] (generic_handle_irq+0x20/0x30)
[  715.043331] [<c008a100>] (generic_handle_irq+0x20/0x30) from [<c002b060>] (asm_do_IRQ+0x60/0x84)
[  715.052578] [<c002b060>] (asm_do_IRQ+0x60/0x84) from [<c00308e0>] (__irq_svc+0x60/0xac)
[  715.061311] Exception stack(0xc04c3f80 to 0xc04c3fc8)
[  715.069256] 3f80: 00000001 00000004 c04c3fb0 c0031cf4 c04c2000 c04c869c c04f30c4 c04c8694
[  715.074277] 3fa0: 00004059 560f5815 00000000 00000000 00000000 c04c3fc8 c0070ccc c0031e64
[  715.082400] 3fc0: 20000013 ffffffff
[  715.090518] [<c00308e0>] (__irq_svc+0x60/0xac) from [<c0031e64>] (cpu_idle+0x50/0xac)
[  715.101768] [<c0031e64>] (cpu_idle+0x50/0xac) from [<c00088e0>] (start_kernel+0x29c/0x2f0)
[  715.101768] [<c00088e0>] (start_kernel+0x29c/0x2f0) from [<0000803c>] (0x803c)

power_supply_get_by_name() uses klists which take a lock with spin_lock() (i.e. takes a lock without disabling IRQs - so in any IRQ handler you may be operating with that lock already hold). Therefore it seems unsafe to use anything klist-related from IRQ context. Also clarified in http://kerneltrap.org/mailarchive/linux-kernel/2010/4/20/4560708

Changed 2 years ago by saadia

After applying the patch, I'm not experiencing the deadlocks on every boot. This patch seems like it fixes the freeze on third dot problem.

Changed 2 years ago by dsd

  • next_action changed from diagnose to add to build
  • milestone changed from 1.75-software to 11.3.0

fixed in arm-3.0-wip 5eff2cf

Changed 2 years ago by dsd

  • next_action changed from add to build to test in build

test in 11.3.0 candidate build 883

Changed 2 years ago by pgf

  • milestone changed from 11.3.0 to 1.75-software

Changed 2 years ago by tonyforster

still faulty os883

Changed 2 years ago by godiard

I reproduced the error with os883. After a lot of tries, pressing many keys and moving the finger over the touchpad frozen at the third dot.

Changed 2 years ago by Quozl

  • next_action changed from test in build to diagnose

Changed 2 years ago by dsd

Next time this happens please connect serial console and leave the system for 10 minutes or so, giving the kernel's lockup detection a chance to kick in. If nothing gets spewed over serial in that time, please run:

pstree -a -p
echo  't' > /proc/sysrq-trigger

Changed 2 years ago by godiard

Attached is a log with the information requested.

The way to trigger the error was start the xo, and move the finger over the touchpad quickly. Doers not happen every time, I have tried almost 10 times until could reproduce it.

Changed 2 years ago by godiard

Changed 2 years ago by greenfeld

Log with information requested by comment:17

Changed 2 years ago by godiard

Today, with kernel 20111107.1918.olpc.4fbb7db the xo frozen in the third dot without help from my part, new log attached.

Changed 2 years ago by godiard

Changed 2 years ago by martin.langhoff

  • priority changed from high to blocker

This still hits, and quite frequently. Will impact on runin and general testing.

Changed 2 years ago by martin.langhoff

Daniel, any further debugging ideas? Should we build with psmouse=y?

Changed 2 years ago by dsd

The above log shows that mouse initialization has hung (as before).

Someone needs to repeat the same diagnosis as before, examining the logs at each stage: enable olpc-kbd debugging (as in #11363), enable lockdep

The root cause of this is possibly shared with the other bugs we have related to communication problems with keyboard/mouse, so attacking those other issues may also be a fruitful approach.

Changed 2 years ago by Quozl

Also reproduces on 11.3.1 os14.

Changed 2 years ago by martin.langhoff

A kernel rpm that should allow testing of this http://dev.laptop.org/~martin/xo1.75kernels/kernel-3.0.0_xo1.75-20111201.1013.olpc.8ab3406.armv7l.rpm

It preserves psmouse as a module. With development kernels (which usually have psmouse built-in) it seems to trigger less often.

Changed 2 years ago by greenfeld

Log of this hang using a kernel with lock debugging enabled to get more info

Changed 2 years ago by pgf

based on sysrq output showing the hung process, i've committed a possible fix in 089fe213ceb780a0b4e0ecc88fadd8c6f43cd935

the patch is correct, but we won't know if it's the real deal for some time. will probably appear in os17.

Changed 2 years ago by martin.langhoff

  • next_action changed from diagnose to test in build

In OS17 and later.

Changed 2 years ago by garysu

psmouse.c: resync failed, issuing reconnect request

Changed 2 years ago by garysu

sometime the touchpad will lost function under sugar.(attachment:fail.log, os18/q4c07) boot freezing seem fixed till now(reboot over 3hrs)

Changed 2 years ago by garysu

log with enable olpc-kbd debugging

Changed 2 years ago by garysu

sorry, serial port's log does not include the olpc-kdb debugging message. will get again with dmesg.

Changed 2 years ago by garysu

all logs in /var/log, fs.log = run dmesg, for lost touchpad function

Changed 2 years ago by greenfeld

It looks from garysu's log like he was testing this with os18. I haven't seen this freeze with os19, but I still am losing control over an AVC touchpad at boot on a SKU 202 machine 50% of the time.

Changed 2 years ago by avatar

According to various fail log I've seen, olpc-keyboard usually fails to get reply when mmp-camera module chimes in:

[    8.835955] olpc_keyboard: olpc_kbd_pad_write which=100 val=3
[    8.838233] olpc_keyboard: olpc_input_rx 1fa
[    8.838335] olpc_keyboard: olpc_kbd_pad_write which=100 val=e6
[    8.877282] mmp-camera mmp-camera.0: Release, 0 frames, 0 singles, 0 delivered
[    9.036964] input: FSPPS/2 Sentelic FingerSensingPad as /devices/platform/olpc-kbd.0/serio1/input/input9
[    9.066884] olpc_keyboard: olpc_kbd_pad_write which=100 val=f4
[    9.207564] olpc_keyboard: olpc_input_rx 1fa
[    9.934554] libertas_sdio mmc1:0001:1: (unregistered net_device): xx:xx:xx:xx:xx:xx, fw 9.70.3p36, cap 0x000003a3
[    9.957475] libertas_sdio mmc1:0001:1: wlan0: Marvell WLAN 802.11 adapter

I've tried to do overnight rebooting test(total 1320 reboots till now) without mmp-camera and libertas_sdio enabled and it appears to me that there's no PS/2-command-issued-without-reply anymore. Perhaps this will help on the FSP losing control problem observed by greenfeld?

Changed 2 years ago by Quozl

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

we believe this symptom (boot freezing on third clock dot) is fixed in the current build, os20.

Note: See TracTickets for help on using tickets.