Ticket #12573 (closed defect: fixed)

Opened 19 months ago

Last modified 16 months ago

mwifiex/8787 resume failure with deluge of "mwifiex_sdio_host_to_card: no wr_port available" error messages

Reported by: shep Owned by: shep
Priority: high Milestone: 13.2.0
Component: kernel Version: Development source as of this date
Keywords: Cc: dsd
Action Needed: test in build Verified: no
Deployments affected: Blocked By:
Blocking:

Description

While testing wake on wireles lan with mwifiex/8787, I have seen another sort of mwifiex/8787 resume failure (four times so far).

This is testing with kernel d71dede9f07dac90eec8e0389afe7e58295611d7.

(Curiously, during my previous testing with kernel a92e022840e7e044fc9a45003f0772314a308191 I never saw this sort of failure. I am not sure if this bug is a regression or was it just random chance that I didn't see this bug before.)

This failure starts with these error messages on resume:

[  450.302766] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[  450.302766] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (1) failed: -1
[  450.302790] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[  450.302802] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (2) failed: -1
[  450.302821] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[  450.302832] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (3) failed: -1
[  450.302850] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_async failed: 0xFFFFFFFF

and then shortly after that, a deluge of error messages like these continue spewing at a rate of more than 50 per second:

[  450.560288] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.566401] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.572449] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.578535] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.589839] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.595897] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available
[  450.601961] mwifiex_sdio mmc0:0001:1: mwifiex_sdio_host_to_card: no wr_port available

(I have seen this bug seen four times so far on an XO-4 C1 1.2 GHz with Q7B15 firwmare but EC upgraded to EC 0.3.12 .)

Attachments

no-wr_port.log (21.8 kB) - added by pgf 19 months ago.

Change History

Changed 19 months ago by dsd

  • cc dsd added
  • priority changed from normal to high
  • milestone changed from Not Triaged to 13.1.0

Changed 19 months ago by pgf

adding a "me too" here. just got the same spew on os34 with kernel 7bbd178, on a 1Ghz B1. didn't catch the start of the spew.

Changed 19 months ago by Quozl

this reminds me of the violation of fundamental "contract" for the wrbitmap register in relation to host_intstatus that was found in open firmware during wireless firmware load. see mitch's mail of 30th jan. does the linux driver rely on the host_intstatus register?

Changed 19 months ago by pgf

i augmented the failure message to show 'ret', and enabled the dev_dbg message in mwifiex_get_wr_port_data() to get a little more info, in case it helps. log attached as no-wr_port.log

Changed 19 months ago by pgf

Changed 19 months ago by Quozl

reviewed, no suggestions.

drivers/net/wireless/mwifiex/sdio.c function mwifiex_process_int_status does indeed filter the read of WR_BITMAP_{U,L} to only occur when DN_LD_HOST_INT_STATUS is set.

the following if statement in the code looks like it tries to handle what was observed in open firmware.

Changed 19 months ago by dsd

The first thing I would look at here is the first message.

[  450.302766] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended

What is mwifiex trying to write while suspended, and why?

Maybe the upper layers of mwifiex are trying to do some I/O before the SDIO layer thinks it is ready. In which case the upper mwifiex layers should be fixed to pay attention to the adapter->is_suspended flag and not submit their I/O requests until later.

Changed 19 months ago by dsd

An easy way to reproduce this on every ping-invoked wakeup is to add the following msleep in mwifiex_sdio_resume()

	msleep(1000);
	adapter->is_suspended = false;

	/* Disable Host Sleep */
	mwifiex_cancel_hs(mwifiex_get_priv(adapter, MWIFIEX_BSS_ROLE_STA),
			  MWIFIEX_ASYNC_CMD);

Changed 19 months ago by dsd

reported to linux-wireless http://marc.info/?l=linux-wireless&m=136338503417562&w=2 and symplicity #478729

Changed 18 months ago by shep

In my testing the past few days I've seen this bug more often without the deluge of "no wr_port available" messages, but the three "not allowed while suspended" during the resume are the same (or very similar) and the interface does not work after that happens.

Changed 18 months ago by shep

I spent some time yesterday trying to understand this bug, reading the mwifiex code in our olpc-kernel/arm-3.5 branch and also reading mwifiex in kernel.org 3.9-rc4. I did find one significant change upstream in 601216e12c65e131453941962f2a8f6e49a2a379 which changes significantly the scenario which (in my not yet perfect understanding of this bug) seemed relevant to the sort of race condition I'm looking for to explain this bug.

I tried "git cherry-pick -x 601216e12c65e131453941962f2a8f6e49a2a379" and am now testing.

So far so good: over 350 wake-on-lan wakeups without any problems, which is more than I think I've seen before.

If this keeps up I will push that later this evening.

Changed 18 months ago by shep

Cherry picking 601216e12c65e131453941962f2a8f6e49a2a379 seems to have helped reduce much the occurrence of this bug, but it still turned up once (after 600+ trouble-free resumes):

[ 8696.845388] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[ 8696.845388] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (1) failed: -1
[ 8696.845416] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[ 8696.845428] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (2) failed: -1
[ 8696.845447] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_sync: not allowed while suspended
[ 8696.845457] mwifiex_sdio mmc0:0001:1: host_to_card, write iomem (3) failed: -1
[ 8696.845474] mwifiex_sdio mmc0:0001:1: mwifiex_write_data_async failed: 0xFFFFFFFF

there were no further error messages, and "rtcwake -m mem -s 60" continued to do suspends and resumes, but there was no wake-on-lan happening after that and the interface no longer works (cannot ping anything from the XO) and the wireless LED is now off.

Changed 18 months ago by shep

I have pushed my cherry-pick of upstream 601216e12c65e131453941962f2a8f6e49a2a379

http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.5&id=5eabcc89bf5fc517197dbf845795dd6eef56975a

Even though I still saw one instance of this bug after adding this cherry-pick, I'm seeing it something like 100 times less often. (I've seen it only once in over 1000 suspend/resumes.)

And I've seen no other problems (in particular, no instances of #12569).

Changed 18 months ago by shep

On Wednesday I tried reverting d71dede9f07dac90eec8e0389afe7e58295611d7 and restarted my testing loop. I've not seen any problems since, other than the initial packet that does the wakeup no longer generates a ping reply (#12554).

So perhaps my original observation that I had not seen this bug before d71dede9f07dac90eec8e0389afe7e58295611d7 (the fix for #12554) is in fact the most relevant clue.

Perhaps we should fix this by pushing a revert of d71dede9f07dac90eec8e0389afe7e58295611d7 and reopening #12554. (Suffering from #12554 is probably preferable to suffering from this bug.)

(Of course, my real hope is that this clue will allow a good fix for both to be developed.)

Changed 18 months ago by dsd

When I looked at #12554 it appeared to me that that issue would affect more than just a lost ping reply, it would affect other packets too. Like you I would prefer not to revert it.

I don't think this will be hard to debug with some time dedicated to it, with the msleep(1000) reproducer that I noted above.

Changed 18 months ago by dsd

The call trace that leads to the "not allowed while suspended" message is:

[<c02b4434>] (mwifiex_write_data_sync+0x0/0xac) from [<c02b451c>] (mwifiex_write_data_to_card+0x3c/0x98)
[<c02b44e0>] (mwifiex_write_data_to_card+0x0/0x98) from [<c02b48cc>] (mwifiex_host_to_card_mp_aggr+0x354/0x46c)
[<c02b4578>] (mwifiex_host_to_card_mp_aggr+0x0/0x46c) from [<c02b4b74>] (mwifiex_sdio_host_to_card+0x190/0x21c)
[<c02b49e4>] (mwifiex_sdio_host_to_card+0x0/0x21c) from [<c02a1f24>] (mwifiex_process_tx+0x90/0x1a0)
[<c02a1e94>] (mwifiex_process_tx+0x0/0x1a0) from [<c02a2ac4>] (mwifiex_send_single_packet+0x134/0x2b0)
[<c02a2990>] (mwifiex_send_single_packet+0x0/0x2b0) from [<c02a3970>] (mwifiex_wmm_process_tx+0x76c/0x800)
[<c02a3204>] (mwifiex_wmm_process_tx+0x0/0x800) from [<c029d7ac>] (mwifiex_main_process+0x340/0x4c8)
[<c029d46c>] (mwifiex_main_process+0x0/0x4c8) from [<c02b4230>] (mwifiex_sdio_interrupt+0x15c/0x178)

Changed 18 months ago by dsd

  • next_action changed from never set to add to build
  • milestone changed from 13.1.0 to 13.2.0

So mwifiex itself is transmitting before mwifiex has finished resuming.

I have pushed a possible fix to arm-3.5, which tweaks netdev state to avoid this situation until things have fully resumed.

That might not be the right fix (hopefully Marvell will come up with one) but it seems like a logical way to serialize the events for now.

Changed 18 months ago by dsd

  • next_action changed from add to build to test in build

Test in 13.2.0 build 3.

Changed 16 months ago by dsd

  • status changed from new to closed
  • resolution set to fixed

Has not been seen since the fix was made. And Marvell have also produced an upstream fix for this.

Note: See TracTickets for help on using tickets.