Ticket #12657 (closed defect: fixed)

Opened 12 months ago

Last modified 10 months ago

Fix btmrvl/mwifiex suspend interactions

Reported by: dsd Owned by: pgf
Priority: normal Milestone: 13.2.0
Component: kernel Version: not specified
Keywords: Cc: cjb, dsd, jvonau
Action Needed: test in build Verified: no
Deployments affected: Australia Blocked By:
Blocking:

Description

Fix the btmrvl and mwifiex drivers so that they can both be loaded and WOL works for the wifi card.

Right now these two drivers for the same piece of hardware do not coexist peacefully during suspend/resume, hence the btmrvl module is blacklisted by the build system.

Symplicity 479906.

Change History

Changed 11 months ago by pgf

  • next_action changed from never set to add to build

these kernel commits allow coexistence of mwifiex and btmrvl:

5e0c463 btmrvl: don't force unpowered suspends
c232e7b mwifiex: don't force unpowered suspends
1280dd5 mmc sdio: handle case where no card driver needs a powered suspend

when this kernel is added to the build, we should remove or rename the blacklist file that currently prevents btmrvl from loading.

Changed 11 months ago by dsd

  • next_action changed from add to build to test in build

Test in 13.2.0 build 6.

Changed 11 months ago by pgf

if either driver wishes the card to remain powered across suspend, then the hang does not occur.

if we enable the no_console_suspend kernel commandline parameter, then the hang does not occur.

when the hang occurs, it happens inside of:

  • bus_remove_device(), called from
  • device_del(), called from
  • sdio_remove_func(), called from
  • mmc_sdio_remove(), called from
  • mmc_suspend_host() after mmc_sdio_suspend() has returned -ENOSYS

the OLPC copy of mmc_suspend_host() has been modified to return -ENOSYS if no card function has requested power be maintained.

i'd provide a "real" stack backtrace, except for the issue of the problem not reproducing if no_console_suspend is enabled.

it's clear that there's mismanagement of devices happening somewhere.

Changed 11 months ago by cjb

It looks like you might be hitting this problem, since it exactly describes your symptom:

               if (err == -ENOSYS || !host->bus_ops->resume) {
                        /*
                         * We simply "remove" the card in this case.
                         * It will be redetected on resume.  (Calling
                         * bus_ops->remove() with a claimed host can
                         * deadlock.)
                         */
                        if (host->bus_ops->remove)
                                host->bus_ops->remove(host);
                        mmc_claim_host(host);
                        mmc_detach_bus(host);
                        mmc_power_off(host);
                        mmc_release_host(host);

.. yet the comment suggests that your use ought to be safe from deadlock, so perhaps the core code itself is buggy.

Changed 11 months ago by pgf

more tracing has led to this more complete call sequence: when the hang occurs, it happens inside of:

  • netif_carrier_off(), called from:
  • mwifiex_remove_card
  • mwifiex_sdio_remove
  • mwifiex_sdio_remove() (via dev->bus->remove(dev)
  • device_release_driver()
  • device_release_driver()
  • bus_remove_device()
  • device_del()
  • sdio_remove_func()
  • mmc_sdio_remove()
  • mmc_suspend_host() after mmc_sdio_suspend() has returned -ENOSYS

by being very judicious in my use of printk (actually, by replacing printk with one-character writes directly to the UART registers), i've been able to recreate the hang with no_console_suspend enabled. with that enabled, i sometimes get a timeout from mwifiex after the hang:

[   38.699170] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[   38.719863] usb usb1: usb auto-resume
[   38.719894] pxau2o-ehci d4208000.usb: resume root hub
[   38.729427] dcon_source_switch to DCON
[   38.766522] hub 1-0:1.0: hub_resume
[   38.766522] hub 1-0:1.0: port 1: status 0507 change 0000
[   38.775280] olpc-dcon: The DCON has control
[   38.779572] usb 1-1: usb auto-resume
[   38.816537] pxau2o-ehci d4208000.usb: GetStatus port:1 status 8001205 4  ACK POWER sig=se0 LPM PE CONNECT
[   38.836527] usb 1-1: finish resume
[   38.840382] hub 1-1:1.0: hub_resume
[   38.844868] pxau2o-ehci d4208000.usb: reused qh ec3d5b40 schedule
[   38.844889] usb 1-1: link qh256-0001/ec3d5b40 start 1 [1/0 us]
[   38.856760] hub 1-1:1.0: hub_suspend
[   38.856760] usb 1-1: unlink qh256-0001/ec3d5b40 start 1 [1/0 us]
[   38.860316] usb 1-1: usb suspend, wakeup 0
[   38.886521] hub 1-0:1.0: hub_suspend
[   38.886538] usb usb1: bus suspend, wakeup 0
[   38.890088] pxau2o-ehci d4208000.usb: suspend root hub
[   38.899357] mmp3_usb_phy_deinit_internal: Deinit usb phy!!!
[   38.916513] olpc_ec_1_75_suspend: suspend sync 00000001
FiIFiI[   38.936512] mmc2: dropping power over suspend
FGiI[   38.941392] mmc1: dropping power over suspend
FMxymB123bGHq!01234678&[[   39.055483] MARK:628 bus_remove_device() 'dev' is 0xec3d4008
[   39.062661] MARK:563 device_release_driver() 'dev' is 0xec3d4008
zxcmn*yo+O[   49.086517] mwifiex_sdio mmc0:0001:1: mwifiex_cmd_timeout_func: Timeout cmd id (1369151250.668437) = 0xe5, act = 0x1
[   49.086541] ------------[ cut here ]------------
[   49.086541] WARNING: at kernel/mutex.c:198 __mutex_lock_slowpath+0x84/0x28c()
[   49.086561] [<c0015770>] (unwind_backtrace+0x0/0x128) from [<c044f468>] (dump_stack+0x20/0x24)
[   49.086631] [<c044f468>] (dump_stack+0x20/0x24) from [<c00240d0>] (warn_slowpath_common+0x5c/0x74)
[   49.086649] [<c00240d0>] (warn_slowpath_common+0x5c/0x74) from [<c0024114>] (warn_slowpath_null+0x2c/0x34)
[   49.086662] [<c0024114>] (warn_slowpath_null+0x2c/0x34) from [<c04539b0>] (__mutex_lock_slowpath+0x84/0x28c)
[   49.086672] [<c04539b0>] (__mutex_lock_slowpath+0x84/0x28c) from [<c0453bd4>] (mutex_lock+0x1c/0x34)
[   49.086682] [<c0453bd4>] (mutex_lock+0x1c/0x34) from [<c0326aa0>] (clk_prepare+0x24/0x44)
[   49.086697] [<c0326aa0>] (clk_prepare+0x24/0x44) from [<c02389ec>] (serial_pxa_console_write+0x34/0x154)
[   49.086726] [<c02389ec>] (serial_pxa_console_write+0x34/0x154) from [<c0025d84>] (call_console_drivers.constprop.12+0x12c/0x170)
[   49.086726] [<c0025d84>] (call_console_drivers.constprop.12+0x12c/0x170) from [<c0026210>] (console_unlock+0x2fc/0x3d0)
[   49.086740] [<c0026210>] (console_unlock+0x2fc/0x3d0) from [<c00267d0>] (vprintk_emit+0x39c/0x3cc)
[   49.086762] [<c00267d0>] (vprintk_emit+0x39c/0x3cc) from [<c044f750>] (printk_emit+0x30/0x38)
[   49.086762] [<c044f750>] (printk_emit+0x30/0x38) from [<c0263358>] (__dev_printk+0x174/0x1a0)
[   49.086787] [<c0263358>] (__dev_printk+0x174/0x1a0) from [<c02634b8>] (dev_err+0x44/0x50)
[   49.086798] [<c02634b8>] (dev_err+0x44/0x50) from [<bf072874>] (mwifiex_cmd_timeout_func+0xd8/0x230 [mwifiex])
[   49.086920] [<bf072874>] (mwifiex_cmd_timeout_func+0xd8/0x230 [mwifiex]) from [<c0034b5c>] (run_timer_softirq+0x290/0x40c)
[   49.086920] [<c0034b5c>] (run_timer_softirq+0x290/0x40c) from [<c002c204>] (__do_softirq+0x118/0x24c)
[   49.086933] [<c002c204>] (__do_softirq+0x118/0x24c) from [<c002c74c>] (irq_exit+0x54/0x6c)
[   49.086943] [<c002c74c>] (irq_exit+0x54/0x6c) from [<c000f920>] (handle_IRQ+0x74/0x94)
[   49.086959] [<c000f920>] (handle_IRQ+0x74/0x94) from [<c0008548>] (asm_do_IRQ+0x18/0x1c)
[   49.086970] [<c0008548>] (asm_do_IRQ+0x18/0x1c) from [<c000e4cc>] (__irq_svc+0x4c/0x94)
[   49.086980] Exception stack(0xec92bf30 to 0xec92bf78)
[   49.086985] bf20:                                     00000000 f0004000 ec92bf68 ec92bfcc
[   49.086993] bf40: ec3c0000 00000001 ec92a000 ec3c7800 7fffffff 00000008 00000001 ec92bfbc
[   49.087002] bf60: ec92bf78 ec92bf78 c0301dbc c0301dc4 60000113 ffffffff
[   49.087010] [<c000e4cc>] (__irq_svc+0x4c/0x94) from [<c0301dc4>] (sdio_irq_thread+0x258/0x2f0)
[   49.087021] [<c0301dc4>] (sdio_irq_thread+0x258/0x2f0) from [<c0045f14>] (kthread+0x9c/0xa8)
[   49.087048] [<c0045f14>] (kthread+0x9c/0xa8) from [<c000f9fc>] (kernel_thread_exit+0x0/0x8)
[   49.087053] ---[ end trace 35d43f93aa9731a4 ]---
[   49.086561] Modules linked in: fuse xt_tcpudp iptable_filter ip_tables x_tables mousedev btmrvl_sdio btmrvl uinput joydev mwifiex_sdio bluetooth mwifiex fb_sys_fops sysimgblt sysfillrect syscopyarea psmouse mmp_camera videobuf2_core videobuf2_dma_sg videobuf2_vmalloc videobuf2_memops zforce ov7670 [last unloaded: udlfb]
[   49.347987] mwifiex_sdio mmc0:0001:1: num_data_h2c_failure = 0
[   49.381887] mwifiex_sdio mmc0:0001:1: num_cmd_h2c_failure = 0
[   49.387597] mwifiex_sdio mmc0:0001:1: num_cmd_timeout = 1
[   49.387606] mwifiex_sdio mmc0:0001:1: num_tx_timeout = 0
[   49.398231] mwifiex_sdio mmc0:0001:1: last_cmd_index = 0
[   49.398238] last_cmd_id: 00000000: e5 00 06 00 06                                   .....
[   49.411618] last_cmd_act: 00000000: 01 00 03 00 03                                   .....
[   49.419827] mwifiex_sdio mmc0:0001:1: last_cmd_resp_index = 4
[   49.419836] last_cmd_resp_id: 00000000: 06 80 06 80 06                                   .....
[   49.434079] mwifiex_sdio mmc0:0001:1: last_event_index = 0
[   49.439539] last_event: 00000000: 0a 00 0a 00 0b                                   .....
[   49.447575] mwifiex_sdio mmc0:0001:1: data_sent=0 cmd_sent=1
[   49.447575] mwifiex_sdio mmc0:0001:1: ps_mode=0 ps_state=0

so i think the problem is that we're still asking the 8787 module to honor commands, after we've told it to go into host sleep mode. but i think the card was supposed to be resumed already in order to prevent that.

Changed 11 months ago by Quozl

  • next_action changed from test in build to code

reproduces on 32007o4.zd after idle suspend' display off, power LED on, no response to keys. does not occur if btmrvl is moved out of modules tree.

Changed 11 months ago by pgf

  • cc dsd added
<dsd> pgf: just a guess (without looking at the code), for that mwifiex bug try dropping the line that sets is_suspended to true
<pgf> okay.  i'm officially hanging it up.  dsd just fixed the bug (which i've been diagnosing on and off for several days) by asking me to delete a line, without even looking at the code.
<dsd> hah, sorry
<pgf> oh, believe me -- it's okay!  :-)
...
<dsd> that flag also got in my way before, which is why i figured that was worth a try
<pgf> do you think it's needed for any of the other host interfaces?
<dsd> just the USB one
<pgf> okay, so it shouldn't be eliminated -- just not maintained in sdio.
<dsd> it would be worth getting an understanding of whats going on though, if you dont have that already. it suggests that we're trying to send a command after that flag is set
<pgf> yeah -- that's where i was when i stopped for the day.
<pgf> "that shouldn't be happening"
<dsd> ok
<pgf> i'll try and understand it tom'w.

i believe the problem is that while the setting of host sleep mode is done synchronously, unsetting it is done async. when mmc realizes that no card function needs a powered suspend, it runs all of the card function resume routines (which is when mwifiex_sdio_resume() sends the async command to cancel host sleep), and then commences removing the card. during the remove sequence, the driver tries to talk to the card, and there's a race condition against the cancellation of host sleep.

dsd's proposed change, above, fixes the problem. so does changing the cancellation of host sleep to be a synchronous command.

daniel -- opinions? we could apply either change, or both, i guess.

Changed 11 months ago by dsd

I don't think there is a race condition with async cancellation of host sleep and other commands - commands are still submitted in order, even if an async one comes first. Even if there is a race here, we have not found the concrete details that would explain a hang like this.

The is_suspended change seems to have its effect in that it disables half of the mwifiex_sdio_resume handler. Its also not really clear why that helps things.

I am also finding that adding printks here and there (or enabling no_console_suspend) makes the bug go away. So without good diagnosis I am not sure if the above 2 changes have any significance other than they affect the timing in small ways.

On IRC you said that the hang happens in netif_carrier_off (called from mwifiex_remove_card) and I have also reproduced the hang there a few times. I've also seen it move around slightly though. Anyway, digging down through the layers I get to the first call to schedule_delayed_work() in linkwatch_schedule_work() - that never returns. I haven't gone further, since I suspect more and more that the problem isn't actually in this code at all.

Maybe the kernel is switching to another task at this time, which hangs the system. Indeed, disabling kernel preemption does make the problem go away as well. But no doubt that also changes timings, so there is a chance this is also an irrelevant detail.

Changed 11 months ago by dsd

Kernel preemption on again. no_console_suspend enabled. I replaced the netif_carrier_off() call with a call to msleep(1000) as a nice hint to the scheduler to run another task. It hangs. And a short time later prints some messages similar to what Paul posted above:

[   49.072163] ------------[ cut here ]------------
[   49.072169] WARNING: at net/sched/sch_generic.c:256 dev_watchdog
[   49.083939] NETDEV WATCHDOG: eth1 (mwifiex_sdio): transmit queue 0 timed out
[   49.093977] ---[ end trace 1b56ae15412359c1 ]---
[   49.093977] mwifiex_sdio mmc0:0001:1: 4294942204 : Tx timeout

So, whatever task gets scheduled at this point never returns, but does not hard hang the system. And it seems like there is some mwifiex badness happening around this time which may well be the real cause of the issue.

Changed 11 months ago by dsd

At this time a flood of interrupts come in but mwifiex_sdio_interrupt() ignores them because surprise_removed is set. This seems bad and making them not be ignored solves the problem - I think it might be a real fix. For more certainty, we should investigate what these interrupts signify when they occur at this stage.

Changed 11 months ago by dsd

2 interrupts arrive at this time:

  1. PS_AWAKE event
  2. HS CFG command response

This seems to meet expectations.

I also confirmed that an interrupt storm is generated if the mwifiex interrupt handler gets called but does not read the "MP regs" that are read in mwifiex_interrupt_status(). Reading that register is needed to clear the interrupt. With the bogus surprise_removed check in place, that register read was not happening.

While we seem to have fixed a real problem here, I am still left without a clear explanation of why the system hung. The interrupt storm would have slowed things down until the handler got disabled later, but shouldn't be directly responsible for a crash.

Changed 11 months ago by dsd

With some sleeps and printks I found a semi-reliable way to reproduce the hang on demand, with no_console_suspend available.

By hooking up show_state() to the power button I found the issue. The btmrvl bluetooth driver is waiting for its main thread to stop. There is a race here meaning that the thread will not always be stopped when requested. And evidently this race becomes a lot more likely to happen when the mwifiex interrupt storm happens at the same time.

Pushed as b572bafeb49

Also pushed the mwifiex fix to stop ignoring interrupts as 523009baf7204

However testing further I can still reproduce a hang :( even with no_console_suspend enabled though, so I should be able to use the same debugging approach...

Changed 11 months ago by dsd

  • next_action changed from code to add to build

The mwifiex change to make it stop ignoring interrupts didn't quite work. I found a case where mwifiex processing an interrupt while simultaneously shuts down results in an infinite loop, causing the hang mentioned at the end of my final comment.

So now lets try an approach where mwifiex acks interrupts while shutting down, but does not act on them. Pushed 36da52ff3a, seems stable.

Changed 11 months ago by jvonau

  • deployment_affected set to Australia

Changed 11 months ago by jvonau

  • cc jvonau added

Changed 11 months ago by dsd

  • next_action changed from add to build to test in build

Test in 13.2.0 build 8.

Changed 10 months ago by dsd

  • status changed from new to closed
  • resolution set to fixed

After fixing #12701 as well we have not seen further crashes in this area.

Note: See TracTickets for help on using tickets.