Opened 3 years ago

Closed 3 years ago

#10969 closed defect (fixed)

libertas data transmit lockup on resume

Reported by: greenfeld Owned by: dsd
Priority: normal Milestone: 11.2.0-final
Component: kernel Version: Development build as of this date
Keywords: Cc: godiard, erikos
Blocked By: Blocking:
Deployments affected: Action Needed: no action
Verified: no

Description

I had Metacity hang on an XO-1.5 with 11.2.0 os22, possibly due to Alt-Tab, using the View selection keys, or a Frame key action. But I don't quite know what caused it.

The window manager hung well enough that the Linux kernel took notice. Logs are attached with some SysRq data at the end.

Attachments (11)

messages (1.0 MB) - added by greenfeld 3 years ago.
Messages file with SysRq data
shell.log (167.5 KB) - added by greenfeld 3 years ago.
Shell.log of the same system taken after the hang occurred
hang2.tbz2 (164.0 KB) - added by greenfeld 3 years ago.
Bundle of logfiles from another hang occurance with a process list and dmesg buffer data
messages4 (483.7 KB) - added by greenfeld 3 years ago.
hang with 20110609.1612 debug kernel
os870custom.log (429.3 KB) - added by greenfeld 3 years ago.
Reproduction with June 25 custom kernel
29June.log (308.9 KB) - added by greenfeld 3 years ago.
Hang with June 29 test kernel
var-log-messages-with-20110630.1459 (79.9 KB) - added by godiard 3 years ago.
log.txt (121.9 KB) - added by godiard 3 years ago.
trace.log (473.0 KB) - added by erikos 3 years ago.
log with kernel: 2.6.35.13_xo1.5-20110630.1459.olpc.775fba3
messages-20110630.2147 (215.6 KB) - added by dsd 3 years ago.
crash log from gonzalo from latest test kernel
messages-20110704.2216 (132.1 KB) - added by godiard 3 years ago.

Change History (33)

Changed 3 years ago by greenfeld

Messages file with SysRq data

Changed 3 years ago by greenfeld

Shell.log of the same system taken after the hang occurred

comment:1 Changed 3 years ago by greenfeld

Dsd suggests this is related to #10972. In both cases we seem to be waiting on the global network lock, which would make this possibly a variation of yet another libertas hang.

comment:2 Changed 3 years ago by erikos

  • Owner changed from erikos to dsd

comment:3 Changed 3 years ago by dsd

  • Component changed from sugar to kernel
  • Summary changed from Metacity hang possibly due to Alt-Tab or Frame key action to rtnl deadlock related to libertas reset?

It also only seems to occur after the libertas card reset, according to the 2 logs that have been collected so far. This may be fixed with olpc-2.6.35 0c00f0e2d691

comment:4 Changed 3 years ago by dsd

One thing to verify next time you see this is if there are old/stale libertas threads around: ps ax | egrep -i "libertas|lbs"

Changed 3 years ago by greenfeld

Bundle of logfiles from another hang occurance with a process list and dmesg buffer data

comment:5 Changed 3 years ago by dsd

Debugged on IRC: NetworkManager is holding the rtnl, and the trace of that process (not included in the attachment due to truncation) is:

[60074.960014] ea737d9c 0000000b ea8703e0 00000000 ea7340b0 b043583d ea8fc678 ea8fc678
[60074.960014] Call Trace:
[60074.960014] [<ed13acdf>] __lbs_cmd+0xe8/0x17e [libertas]
[60074.960014] [<b054323c>] ? __debug_check_no_obj_freed+0x4f/0x137
[60074.960014] [<b043583d>] ? autoremove_wake_function+0x0/0x34
[60074.960014] [<ed147dc3>] lbs_get_wireless_stats+0x1c0/0x31d [libertas]
[60074.960014] [<ed138eb1>] ? lbs_cmd_copyback+0x0/0x26 [libertas]
[60074.960014] [<b0706fdb>] get_wireless_stats+0x42/0x44
[60074.960014] [<b0706fef>] iw_handler_get_iwstats+0x12/0x41
[60074.960014] [<b0707259>] ioctl_standard_call+0x207/0x296
[60074.960014] [<b071933e>] ? sub_preempt_count+0x81/0x8e
[60074.960014] [<b0715bc7>] ? mutex_lock_nested+0x223/0x23d
[60074.960014] [<b0669b9a>] ? __dev_get_by_name+0x6e/0x7f
[60074.960014] [<b0707395>] wext_handle_ioctl+0xad/0x18c
[60074.960014] [<b0706fdd>] ? iw_handler_get_iwstats+0x0/0x41
[60074.960014] [<b066c9c5>] dev_ioctl+0x600/0x614
[60074.960014] [<b0496ee6>] ? fd_install+0x26/0x4a
[60074.960014] [<b065cd21>] sock_ioctl+0x1fc/0x208
[60074.960014] [<b04a4513>] vfs_ioctl+0x33/0x91
[60074.960014] [<b065cb25>] ? sock_ioctl+0x0/0x208
[60074.960014] [<b04a4ad7>] do_vfs_ioctl+0x4a1/0x4cc
[60074.960014] [<b0499684>] ? fget_light+0x3f/0xd6
[60074.960014] [<b041b42c>] ? get_parent_ip+0xb/0x31
[60074.960014] [<b071933e>] ? sub_preempt_count+0x81/0x8e
[60074.960014] [<b04a4b48>] sys_ioctl+0x46/0x66
[60074.960014] [<b04026d0>] sysenter_do_call+0x12/0x31

i.e. smells very much like what I fixed yesterday, bad command timeout handling...

comment:6 Changed 3 years ago by dsd

New kernel for testing:

http://dev.laptop.org/~dsd/20110609/kernel-2.6.35.13_xo1.5-20110609.1612.olpc.32947a8.i586.rpm

This one logs the entry/exit points of lbs_get_wireless_stats. It won't solve the crash but will help us see what is going on when it next gets reproduced.

Changed 3 years ago by greenfeld

hang with 20110609.1612 debug kernel

comment:7 Changed 3 years ago by dsd

  • Summary changed from rtnl deadlock related to libertas reset? to rtnl deadlock due to libertas get_wireless_stats hang

That shows that its not related to the reset - as we were suspecting more and more - as no reset happened before that hang.

It also shows the hang was "fresh", not some artifact of some previous suspend/resume. The system resumed, started to run a command, and hung there and then.

comment:8 Changed 3 years ago by dsd

Please leave at least 1 machine running with CMD and THREAD lbs debugging enabled on any previous kernel.

Then please focus testing on this one, with no extra debugging enabled:
http://dev.laptop.org/~dsd/20110610/kernel-2.6.35.13_xo1.5-20110610.1637.olpc.d831727.i586.rpm

It will log (out of the box) info about queued, submitted and completed commands, and the timeout handler. If you can log a hang with these messages enabled, it should tell us a lot. If running on multiple systems, you could modify one of them with your rsyslog trick which seemed to help last time we were dealing with extra logging vs timing-sensitive issues.

comment:9 Changed 3 years ago by dsd

  • Cc godiard added

I tried to reproduce this by bombarding the card with commands during suspend/resume. I couldn't reproduce this but I did encounter other crashes. I've now fixed those, and they may help this case too, but I'm not really sure (and still can't reproduce it).

new kernel for testing:
http://xs-dev.laptop.org/~dsd/repos/f14-xo1.5/kernel-2.6.35.13_xo1.5-20110618.1108.olpc.39710f5.i586.rpm
(will also be in next build)

Remember the kernel install trick:
http://wiki.laptop.org/go/Kernel#Building_an_OLPC_kernel

Unlike the previous kernel, this one has no additional log messages (which are suspected to further mask the issue). So hopefully if the bug is still present it is as easy to reproduce as before.

comment:10 Changed 3 years ago by dsd

Thanks for testing - Sam reproduced this at http://dev.laptop.org/attachment/ticket/10748/870hang.tgz - so we're still not fixed

new diagnosis approach coming later

comment:11 Changed 3 years ago by dsd

new kernel for testing ASAP, please:

http://dev.laptop.org/~dsd/20110625/kernel-2.6.35.13_xo1.5-20110625.1447.olpc.e7e0a24.i586.rpm

this one should detect the hang, at which point it will print 3 lines of log messages (starting with a "MISSED TIMEOUT?" line), then hang the system as before.

please reproduce and post new logs. as these new logs will only appear when the hang happens, it shouldnt be any harder than before to reproduce the issue.

Changed 3 years ago by greenfeld

Reproduction with June 25 custom kernel

Changed 3 years ago by greenfeld

Hang with June 29 test kernel

comment:13 Changed 3 years ago by dsd

please retest with http://dev.laptop.org/~dsd/20110630/kernel-2.6.35.13_xo1.5-20110630.1459.olpc.775fba3.i586.rpm

includes the same method to catch the bug when it occurs, but it also includes a potential fix that hopefully means the bug does not reproduce

Changed 3 years ago by godiard

Changed 3 years ago by godiard

comment:14 Changed 3 years ago by dsd

thanks.

now please test this kernel and post logs when reproduced:

http://dev.laptop.org/~dsd/20110630/2/kernel-2.6.35.13_xo1.5-20110630.2147.olpc.f073a82.i586.rpm

no fixes have been made, but when the error occurs it will provide new info for analysis

Changed 3 years ago by erikos

log with kernel: 2.6.35.13_xo1.5-20110630.1459.olpc.775fba3

comment:15 Changed 3 years ago by dsd

  • Cc erikos added

Thanks Simon but I need a log with the newer kernel linked to 2 comments above

Changed 3 years ago by dsd

crash log from gonzalo from latest test kernel

comment:16 Changed 3 years ago by dsd

Changed 3 years ago by godiard

comment:17 Changed 3 years ago by dsd

  • Summary changed from rtnl deadlock due to libertas get_wireless_stats hang to libertas data transmit lockup on resume

Thanks! Theory confirmed. It revolves around the fact that the hardware can only accept one host-to-card transaction at once, and the driver is good (too good?) at obeying this.

This turns out to be similar to #10748. In #10748 we saw that sometimes, during resume, libertas randomly stops responding to commands. To fix that issue, we had to perform some fixes around the command detection timeout logic and add the ability to reset the SDIO card.

In this ticket, what we are seeing is that the card randomly stops responding to data transmissions. There is no such timeout mechanism to detect this hang for data TX. The driver then (correctly) blocks all other transmission to the card, waiting for the data transmission to complete, causing the commands executed here to hang.

Again, this is a firmware bug.

libertas-level workaround: Detect data transmission timeouts, and reset the card. This will result in a 5 second freezeup which is hard to avoid - I'd be uncomfortable with a lower timeout value.

comment:18 Changed 3 years ago by dsd

New kernel for testing: http://dev.laptop.org/~dsd/20110705/kernel-2.6.35.13_xo1.5-20110705.1813.olpc.f4add80.i586.rpm

I have removed the debug hacks and messages added in previous kernels, and added in a TX lockup handler in the least risky manner that I could.

You should now no longer see the issue. Or, if you see it, you will see a 5 second hang followed by a complete wifi reset.

When the workaround comes into play, the following will be printed in the kernel logs: SDIO TX lockup detected

When testing this, in addition to observing that this crash no longer happens, it would be good to also verify that this workaround is not resetting the card when it shouldn't. If you could run every few hours:

dmesg | grep "SDIO TX"

If anything comes back, please post the logs here, and I'll try to confirm whether or not it was right to reset the card in that situation.

Please test this kernel as much as possible, if there aren't any issues introduced by it it will be included in Thursday's build.

comment:19 Changed 3 years ago by dsd

  • Action Needed changed from diagnose to add to build

added the TX lockup detector in kernel-2.6.35.13_xo1.5-20110707.0809.olpc.da7074b

comment:20 Changed 3 years ago by dsd

#11050 is opened to track the real resolution to this issue, which would be a firmware/hardware fix.

comment:21 Changed 3 years ago by dsd

  • Action Needed changed from add to build to test in build

Test in 11.2.0 candidate 872.

comment:22 Changed 3 years ago by greenfeld

  • Action Needed changed from test in build to no action
  • Resolution set to fixed
  • Status changed from new to closed

Tested so far with 11.2.0 872, 873 & 874. I saw a few TX resets appear in the dmesg log with the network recovering afterward, so this appears to be working.

Note: See TracTickets for help on using tickets.