Opened 5 years ago

Closed 4 years ago

#9836 closed defect (fixed)

BUG() in if_sdio_handle_cmd()

Reported by: dsaxena Owned by: dsaxena
Priority: normal Milestone: 1.5-software-later
Component: kernel Version: 1.5-B3
Keywords: Cc: rchokshi
Blocked By: Blocking:
Deployments affected: Action Needed: test in build
Verified: no

Description

Seen while running suspend/resume loop without Network Manager and with a hosting pinging the XO-1.5 machine.

[  626.372744] Kernel BUG at d0d256d0 [verbose debug info unavailable]
[  626.372881] invalid opcode: 0000 [#1] PREEMPT
[  626.372881] last sysfs file: /sys/power/state
[  626.372881] Modules linked in: fuse uinput mousedev libertas_sdio psmouse serio_raw libertas lib80211 [last unloaded: scsi_wait_scan]
[  626.372881]
[  626.372881] Pid: 1440, comm: ksdioirqd/mmc1 Tainted: G        W  (2.6.31.6 #10) XO
[  626.372881] EIP: 0060:[<d0d256d0>] EFLAGS: 00010002 CPU: 0
[  626.372881] EIP is at if_sdio_interrupt+0x3f7/0x7c7 [libertas_sdio]
[  626.372881] EAX: 00000755 EBX: ca500000 ECX: c9b2c340 EDX: 00000001
[  626.372881] ESI: c9b2c340 EDI: 00000113 EBP: ca1ebf78 ESP: ca1ebf50
[  626.372881]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[  626.372881] Process ksdioirqd/mmc1 (pid: 1440, ti=ca1eb000 task=c9901a50 task.ti=ca1eb000)
[  626.372881] Stack:
[  626.372881]  00000001 00000286 c9b2c340 00000113 00000000 00000000 00000000 00000000
[  626.372881] <0> c99300c0 7fffffff ca1ebfac b067419f c99bd0e0 00000000 c99bd274 00000001
[  626.372881] <0> c99300c0 00000001 00000001 02000000 cae70e18 c99bd0e0 b06740bb ca1ebfe0
[  626.372881] Call Trace:
[  626.372881]  [<b067419f>] ? sdio_irq_thread+0xe4/0x1d1
[  626.372881]  [<b06740bb>] ? sdio_irq_thread+0x0/0x1d1
[  626.372881]  [<b04325c1>] ? kthread+0x6d/0x72
[  626.372881]  [<b0432554>] ? kthread+0x0/0x72
[  626.372881]  [<b0403103>] ? kernel_thread_helper+0x7/0x10
[  626.372881] Code: 78 1d 00 00 e8 c5 58 a3 df 31 d2 8b 4d e0 89 45 dc 8b 45 e0 80 b8 48 0b 00 00 00 0f 94 c2 8d 82 54 07 00 00 83 7c 81 0c 00 74 04 <0f> 0b eb fe 8b 75 e0 8b 4d e4 89 4c 86 0c 69 c2 08 09 00 00 8b
[  626.372881] EIP: [<d0d256d0>] if_sdio_interrupt+0x3f7/0x7c7 [libertas_sdio] SS:ESP 0068:ca1ebf50
[  626.372881] ---[ end trace ba4980c0365b189d ]---
[  626.372881] note: ksdioirqd/mmc1[1440] exited with preempt_count 1
[  626.557480] libertas leave: if_sdio_host_to_card(), ret 0
[  626.573277] BUG: spinlock lockup on CPU#0, lbs_main/1792, c9b2e0b8
[  626.573277] Pid: 1792, comm: lbs_main Tainted: G      D W  2.6.31.6 #10
[  626.573277] Call Trace:
[  626.573277]  [<b07591bf>] ? printk+0x14/0x1d
[  626.573277]  [<b0580634>] _raw_spin_lock+0xc0/0xe0
[  626.573277]  [<b075af70>] _spin_lock_irq+0x1d/0x20
[  626.573277]  [<cb9ac9e2>] lbs_thread+0x548/0x6b8 [libertas]
[  626.573277]  [<b041ca06>] ? default_wake_function+0x0/0x12
[  626.573277]  [<cb9ac49a>] ? lbs_thread+0x0/0x6b8 [libertas]
[  626.573277]  [<b04325c1>] kthread+0x6d/0x72
[  626.573277]  [<b0432554>] ? kthread+0x0/0x72
[  626.573277]  [<b0403103>] kernel_thread_helper+0x7/0x10

This same run included a few tx timeouts prior to this, but not 100% sure they are related to this issue or a separate issue:

[  511.005055] WARNING: at net/sched/sch_generic.c:246 dev_watchdog+0x102/0x17f()
[  511.012715] Hardware name: XO
[  511.015857] NETDEV WATCHDOG: eth0 (libertas_sdio): transmit queue 0 timed out
[  511.023434] Modules linked in: fuse uinput mousedev libertas_sdio psmouse serio_raw libertas lib80211 [last unloaded: scsi_wait_scan]
[  511.036390] Pid: 0, comm: swapper Not tainted 2.6.31.6 #10
[  511.042187] Call Trace:
[  511.044780]  [<b042066e>] warn_slowpath_common+0x6a/0x81
[  511.050428]  [<b06de896>] ? dev_watchdog+0x102/0x17f
[  511.055676]  [<b04206c3>] warn_slowpath_fmt+0x29/0x2c
[  511.061052]  [<b06de896>] dev_watchdog+0x102/0x17f
[  511.066124]  [<b075ae47>] ? _spin_unlock_irqrestore+0x1c/0x31
[  511.072215]  [<b0580b96>] ? debug_object_deactivate+0x9a/0xc2
[  511.078305]  [<b04287ac>] ? cascade+0x50/0x63
[  511.082936]  [<b06de794>] ? dev_watchdog+0x0/0x17f
[  511.088001]  [<b042890b>] run_timer_softirq+0x14c/0x1a2
[  511.093595]  [<b0425433>] __do_softirq+0xaf/0x16e
[  511.098611]  [<b0425384>] ? __do_softirq+0x0/0x16e
[  511.103706]  <IRQ>  [<b0425173>] ? irq_exit+0x30/0x67
[  511.109100]  [<b0403b41>] ? do_IRQ+0x78/0x8b
[  511.113635]  [<b0402f69>] ? common_interrupt+0x29/0x30
[  511.119083]  [<b05d800a>] ? acpi_idle_enter_bm+0x23f/0x26a
[  511.124913]  [<b066e1c9>] ? cpuidle_idle_call+0x5f/0x96
[  511.130454]  [<b0401bc3>] ? cpu_idle+0x3b/0x68
[  511.135193]  [<b07508cb>] ? rest_init+0x6b/0x6d
[  511.140037]  [<b08fb6ba>] ? start_kernel+0x29a/0x29f
[  511.145317]  [<b08fb06b>] ? i386_start_kernel+0x6b/0x72
[  511.150852] ---[ end trace ba4980c0365b189c ]---
[  511.155735] libertas: tx watch dog timeout

Attachments (4)

overnight-wifi.log (43.5 KB) - added by cjb 5 years ago.
pingflood-crash-during-suspend (361.1 KB) - added by cjb 5 years ago.
pingflood-crash-on-resume (277.8 KB) - added by cjb 5 years ago.
pingflood-crash-on-resume2 (118.2 KB) - added by cjb 5 years ago.

Download all attachments as: .zip

Change History (33)

comment:1 Changed 5 years ago by Quozl

  • Action Needed changed from never set to diagnose
  • Component changed from not assigned to kernel
  • Milestone changed from Not Triaged to 1.5-software-update
  • Owner set to dsaxena

System dead or continuing?

comment:2 Changed 5 years ago by Quozl

  • Blocking 9858 added

comment:3 Changed 5 years ago by dsaxena

System dead.

comment:4 Changed 5 years ago by Quozl

  • Blocking 9858 removed

Remove as blocking software manufacturing. Per cjb, ed, dsd, reuben, quozl.

comment:5 Changed 5 years ago by cjb

  • Cc rchokshi added

Going to suspend while pingflooding was enough to reproduce this on the second suspend -- attaching the full log. The (B3, ECOs) system temporarily came back from resume, and then crashed hard with the backlight still dimmed a few seconds later.

Questions:

  • we first see a timeout waiting for hardware interrupt -- did we fail to quiesce some state on the way down?
  • mmc1:0001: error -110 reading SDIO_CCCR_INTx suggests an SD-level failure/hang at that point
  • Which line of code is if_sdio_interrupt()+0x3f7? The kernel's the build os63 2.6.31_xo1.5-20091215.1440.1.olpc.9815d58

We should ask for Marvell's help with this one, too.

Changed 5 years ago by cjb

comment:6 Changed 5 years ago by cjb

(The end of the log is an intentional power cycle, not a reboot.)

comment:7 Changed 5 years ago by cjb

Well, I can answer the last of those three questions:

% gdb drivers/net/wireless/libertas/if_sdio.o                                 
(gdb) list *(if_sdio_interrupt)+0x3f7
(gdb) list *(if_sdio_interrupt)+0x3f7
0x1711 is in if_sdio_handle_cmd (drivers/net/wireless/libertas/if_sdio.c:220).
215		}
216	
217		spin_lock_irqsave(&priv->driver_lock, flags);
218	
219		i = (priv->resp_idx == 0) ? 1 : 0;
220		BUG_ON(priv->resp_len[i]);
221		priv->resp_len[i] = size;
222		memcpy(priv->resp_buf[i], buffer, size);
223		lbs_notify_command_response(priv, i);
224	

comment:8 Changed 5 years ago by dsaxena

  • Summary changed from Invalid opcode in Libertas if_sdio_interrupt() routine to BUG() in if_sdio_handle_cmd()

OK, so we're hitting the BUG_ON(priv->resp_len[i]); as BUG() simply generates the invalid opcode as we have CONFIG_DEBUG_BUGVERBOSE disabled.

Doing a cscope search shows that priv->resp_len should be cleared after a command is processed in lbs_thread(). The driver can only handle one command response at a time (i.e., there is no response queue equivalent to the outbound command queue) so this looks like some sort of race where we're getting responses at a high rate (makes sense given your pingflood) and have not completed processing a response before the next one comes in.

comment:9 Changed 5 years ago by cjb

Makes sense. Could you report to the libertas-dev list, please?

comment:10 follow-up: Changed 5 years ago by dsaxena

Yes, but would also like to get some more data.

1) Can you reproduce this regularly with the ping flood?

2) Can you reproduce with /sys/module/libertas/parameters/libertas_debug set to 0x404000 and upload that log?

comment:11 in reply to: ↑ 10 Changed 5 years ago by cjb

Replying to dsaxena:

Yes, but would also like to get some more data.

1) Can you reproduce this regularly with the ping flood?

Yes, always within one or two suspends.

2) Can you reproduce with /sys/module/libertas/parameters/libertas_debug set to 0x404000 and upload that log?

Okay. pingflood-crash-during-suspend shows a crash on the way into suspend. pingflood-crash-on-resume is a run where I delayed starting the pingflood until after suspend had happened. pingflood-crash-on-resume2 is another run with the same parameters. It's all easy to reproduce.

Changed 5 years ago by cjb

Changed 5 years ago by cjb

Changed 5 years ago by cjb

comment:12 follow-ups: Changed 5 years ago by Quozl

Symptom reproduced three consecutive times on the same unit here, with os63, only through turning the system on and waiting for the symptom to occur. It did not occur previously.

(The difference between today and yesterday is the presence of an ad-hoc network formed between a WRT54 and an XO-1).

On the third time a serial console was attached, and the stack traces matched #9836.

There was no response to momentary power button after the second BUG.

comment:13 Changed 5 years ago by dsaxena

pingflood-crash-during-suspend exhibits a different behaviour but it is all appears to be
related to a race in command processing.

In this case, we are seeing "{{{BUG: unable to handle
kernel paging request at 00100100"}}} in lbs_stop_card() and per <linux/posion.h>:

/*
 * These are non-NULL pointers that will result in page faults
 * under normal circumstances, used to verify that nobody uses
 * non-initialized list entries.
 */
#define LIST_POISON1  ((void *) 0x00100100)

The other two logs, from pingflood on suspend, didn't hit a BUG() (though they may if continued) but just start to timeout and give us list corruption messages.

Will investigate and also email Marvell folks.

comment:14 in reply to: ↑ 12 Changed 5 years ago by Quozl

Replying to Quozl:

Symptom reproduced three consecutive times on the same unit here, with os63, only through turning the system on and waiting for the symptom to occur. It did not occur previously.

Just a reminder that there was no high level of wireless activity in this situation; it just happened by itself on resume.

comment:15 Changed 5 years ago by cjb

  • Priority changed from high to blocker

With a normal (non-pingflood) ping running, got this after only 30 resumes, so this will definitely block -update.

comment:16 in reply to: ↑ 12 Changed 5 years ago by Quozl

Replying to Quozl:

(The difference between today and yesterday is the presence of an ad-hoc network formed between a WRT54 and an XO-1).

Upon further testing, the symptom still occurred with the ad-hoc network absent (the AP and the XO-1 fully powered off). That excludes the network as a contributing cause.

The symptom (the BUG at if_sdio_interrupt+0x3f7/0x7ca [libertas_sdio]) is solidly reproducible here, five out of five tries on two particular laptops, using only "boot, then wait" with os64.

These two laptops are XO-1.5 B2 with cardboard tag SD PWR ECO and SER PORT JUMPER. The symptom occurs with or without the serial cable to the logging host being connected.

The other two XO-1.5 B2 without serial port are not reproducing this symptom.

comment:17 Changed 5 years ago by Quozl

  • Milestone changed from 1.5-software-later to 1.5-software-update

comment:18 Changed 5 years ago by dsd

Having trouble reproducing this on my XO-1.5 B3 on os103 because I'm quickly running into other issues which don't seem to have been reported previously.

I'm booting an XO, connecting to my wifi network, and running a ping from another computer. I have also experimented with "ping -f" and with running a "while true; do echo mem > /sys/power/state; sleep 0.4; done" loop on the XO.

Each time, within a few minutes my test is halted by #9855, #9959, #9960, #9962 or #9963. #9961 is also slowing things down. Am I really the first to see these issues?

I am also seeing a lot of

[  135.032106] libertas: EVENT: unknown event id 1

Do we know what this means?

comment:19 Changed 5 years ago by dsd

OK, I can reproduce this after all and I have an explanation.

The libertas driver keeps 2 buffers for responses and alternates them between them for incoming responses. That way the driver can be processing 1 response when another one arrives.

When a command response arrives, it is put into the "other" buffer and then the main worker thread is notified that there is a new response. The main worker thread processes the response, then clears resp_len to mark the buffer as free.

The driver makes no guarantee (and does not attempt to verify) that the previous response has been fully processed before the next one arrives. Dan Williams says that this will not happen at http://lists.infradead.org/pipermail/libertas-dev/2009-December/002921.html:

The driver only handles one command at a time because the firmware spec also says that the firmware only handles one command at a time.

And if the firmware behaves in this way, it seems fine. Commands are only sent by the worker thread, and responses are processed first, so we're never going to send 2 commands at once.

What happens in the failure case is:

  1. command response interrupt arrives, gets saved into slot 1
  2. interrupt handler notifies main thread that new response is available in slot 1
  3. before the thread has had time to do anything, another command response interrupt arrives, response gets saved into slot 0
  4. interrupt handler notifies main thread that new response is available in slot 0
  5. main thread wakes up, processes slot 0, clears resp_len[0] to mark it as completed
  6. command response interrupt arrives, it should be saved in slot 1 but because the command response received in step 1 was never processed (and hence never cleared) then we hit this bug

So, according to Dan William's explanation, this is a firmware bug. I'll now attempt to dump the contents of those 2 spurious command responses, as well as confirm which outgoing commands might have caused them.

comment:20 Changed 5 years ago by dsd

sequence of events:

  • cmd 6 (scan) submitted
  • cmd 6 response received
  • system starts to suspend
  • cmd 43 (host sleep cfg) submitted
  • cmd 43 response received
  • cmd 45 (host sleep activate) submitted
  • cmd 45 response received
  • CPU goes to sleep
  • CPU woken up by incoming packet
  • event 1 arrives (unknown)
  • event 18 arrives (MACREG_INT_CODE_HOST_AWAKE)
    • NOTE: this event does arrive on every resume. in fact, it can be seen as the cause of the crash - when it arrives, the crash happens soon after
  • in response to event 18, lbs_send_confirmwake() directly submits cmd 44 (wakeup confirm) (without going through the normal queuing/threading infrastructure) and does not wait for response
  • cmd 6 (scan) submitted as normal
  • cmd 44 response received (wakeup confirm)
    • full response contents: 44 80 08 00 ae 00 00 00
    • not processed due to what happens immediately after...:
  • cmd 6 response received
  • cmd 43 (host sleep cfg) submitted
  • cmd 43 response received but this triggers the BUG() due to attempting to overwrite the unprocessed wakeup confirm response from above

so, 2 issues at hand:

  1. firmware only generates event 18 sometimes - shouldn't it be consistent?
  2. handling of event 18 submits cmd 44 without expecting or waiting for a response

comment:21 Changed 5 years ago by dsd

Followed up here: http://lists.infradead.org/pipermail/libertas-dev/2010-January/002939.html

Sent summary to Marvell.

Will push an ugly workaround to olpc-2.6.31.

comment:22 Changed 5 years ago by Quozl

Tested in os105, symptom no longer reproducible using my previously successful method.

comment:23 follow-up: Changed 5 years ago by Quozl

  • Priority changed from blocker to high

Given this ugly workaround continues to function, I'm lowing the priority of this ticket. What is the status; are we still waiting for the final fix?

comment:24 in reply to: ↑ 23 ; follow-up: Changed 5 years ago by dsaxena

Replying to Quozl:

Given this ugly workaround continues to function, I'm lowing the priority of this ticket. What is the status; are we still waiting for the final fix?

Looks like Marvell never responded. I'll ping them again, but it looks like the clean fix would be to issue command 0x44 using the normal queuing mechanism.

comment:25 in reply to: ↑ 24 Changed 5 years ago by cjb

Replying to dsaxena:

Looks like Marvell never responded. I'll ping them again, but it looks like the clean fix would be to issue command 0x44 using the normal queuing mechanism.

Isn't that what the patch they sent you does? olpc-2.6-The-patch-fixes-olpc-bug-9836-BUG-in-if_sdio_ha.patch, in http://lists.laptop.org/private/techteam/2010-January/004240.html

comment:26 Changed 5 years ago by dsaxena

Yep, sure is. Looks like it got lost in my backlog of emails. :/

comment:27 Changed 5 years ago by Quozl

  • Action Needed changed from diagnose to code
  • Milestone changed from 1.5-software-update to 1.5-software-later
  • Priority changed from high to normal

Review.

The hack is working well. Pushing out to -later since we can't afford the risk of this right now.

I've found the patch from the mailing list, dated 7th Jan, hash 9f3a795, by Amitkumar Karwar, forwarded by Bing Zhao on 13th Jan.

I've not been able to find this patch in kernel git. The first patch band isn't yet applied to lbs_queue_cmd in cmd.c.

Let me know if you need a copy of the patch.

comment:28 Changed 4 years ago by pgf

it would be really useful to include test scripts used to reproduce bugs and verify fixes in trac along with the bug.

i've built without our "ugly workaround" and with the marvell patch, and things seem to be working. i can recreate (i think) the original symptom(s) with the following:

from another machine, do a ping flood of the laptop while it runs this:

[root@xo-a7-2a-32 olpc]# /etc/init.d/NetworkManager stop
Stopping NetworkManager daemon: [  OK  ]
[root@xo-a7-2a-32 olpc]# stop powerd
powerd (stop) running, process 2284
powerd (stop) pre-stop, (main) process 2284
powerd (stop) stopping, process 2284
powerd (stop) killed, process 2284
[  150.053173] dcon_freeze_store: 0
powerd (stop) post-stop
powerd (stop) waiting
[root@xo-a7-2a-32 olpc]# cat netsetup
ifconfig eth0 down
dhclient -r eth0
ifconfig eth0 up
iwconfig eth0 essid "ml1cc"
iwconfig eth0 mode Managed
dhclient eth0
[root@xo-a7-2a-32 olpc]# while :; do echo awake;sleep 2;./netsetup; sleep 3; echo asleep; ethtool -s eth0 wol u; rtcwake -s 4 -m mem; done
awake
asleep
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Wed Jun 23 17:52:19 2010

+rawake
asleep
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Wed Jun 23 17:52:35 2010

+rawake
asleep
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Wed Jun 23 17:52:48 2010

þ+rawake
asleep
rtcwake: assuming RTC uses UTC ...

i'm seeing what's possibly a new symptom, though, and that is that after running this for a while on XO-1.5, the wireless capabilities of eth0 altogether. eth0 doesn't go away -- just its wireless extensions.

comment:29 Changed 4 years ago by pgf

  • Action Needed changed from code to test in build
  • Resolution set to fixed
  • Status changed from new to closed

the "possibly new symptom" referred to above has gone away, with the removal of a debug change i'd made previously.

so i believe the marvell patch fixes this bug. its been pushed, and should be in build 127 or later. (current kernel version is 8db05fd)

Note: See TracTickets for help on using tickets.