Opened 4 years ago

Last modified 3 years ago

#10232 new defect

WiFi dies on suspended XO-1, os300

Reported by: hal.murray Owned by: dsaxena
Priority: low Milestone: 11.3.0
Component: wireless Version: not specified
Keywords: Cc: dsd, sascha_silbe, sridhar
Blocked By: Blocking:
Deployments affected: Action Needed: diagnose
Verified: no

Description

http://dev.laptop.org/ticket/10207 has the recipe to make suspend work on XO-1.
(delete the "a" from "ethtool -s eth0 wol ua" in /usr/sbin/powerd)

So I've been testing it.

It works most of the time, but occasionally wakeup from WiFi stops working.

I know about wakeup on arp not working. I can work around that by hardwiring that slot into the arp cache on my PC.

I saw it stop working a couple times a few days ago. I think the WiFi LED was still on, but I'm not sure. I'm pretty sure it recovered when I poked the touchpad. This would match arp not working, but I had the cache setup... Mumble. Both times happened while I was ssh-ed in.

I've got a stuck case right now. That was after working fine
for several days of light traffic. (I'm running a fake ntpd server so I can monitor the clock. So it's the XO is seeing the UDP equivalent of a ping every 5 minutes from two sources.)

The WiFi LED is off. It blinks when I wake it up by poking the touchpad, but the LED stays off and ping doesn't work.

ifconfig says eth0 and msh0 are up. I've grabbed a copy of /var/log/messages but I don't see anything interesting in there.

Attachments (11)

file.txt (1.6 KB) - added by Quozl 4 years ago.
test
0001-try-to-reassociate-after-resume-on-XO-1.patch (1.6 KB) - added by sascha_silbe 4 years ago.
WIP patch to re-associate after resume
c2-dmesg.txt (120.6 KB) - added by hal.murray 4 years ago.
dmesg from system that wasn't responding
c2-messages.txt (886.0 KB) - added by hal.murray 4 years ago.
/var/log/messages from system that wasn't responding
c2-messages.2.txt (640.3 KB) - added by hal.murray 4 years ago.
/var/log/messages with EC debug info for another deaf WiFi.
c2-dmesg.2.txt (119.7 KB) - added by hal.murray 4 years ago.
dmesg after hang with LED off
c2-messages.3.txt (963.3 KB) - added by hal.murray 4 years ago.
/var/log/messages after hang with LED off
c2-wpa_supplicant.log (105.4 KB) - added by hal.murray 4 years ago.
/var/log/wpa-supplicant.log after hang with LED off
c2-messages.4.txt (547.1 KB) - added by hal.murray 4 years ago.
/var/log/messages from LED-off hang with new powerd
c2-wpa_supplicant.2.log (216.6 KB) - added by hal.murray 4 years ago.
/var/log/wpa_supplicant.log from LED-off hang with new powerd
c2-powerd.trace (1.3 MB) - added by hal.murray 4 years ago.
/var/log/powerd.trace from LED-off hang with new powerd

Change History (40)

comment:1 Changed 4 years ago by dsd

  • Cc dsd added

It would be interesting to know if #9960 is reproducible on XO-1.

What does iwconfig say? Perhaps association with the access point has been lost?

comment:2 Changed 4 years ago by hal.murray

What does iwconfig say? Perhaps association with the access point has been lost?

I don't see anything interesting in ifwconfig. It looks the same as another XO that is working. In particular, the Access Point slot is the same on both systems.

Is there anything special I should check? I've grabbed a copy and can upload it after I reboot.

It would be interesting to know if #9960 is reproducible on XO-1

I will try #9960 after I reboot. I expect it won't suspend as that logic now checks network traffic.

comment:3 Changed 4 years ago by dsd

You should disable that logic while testing. If we can reproduce that bug on XO-1 it would become pretty clear that we can't expect wake-on-WOL to work reliably, as is the case on XO-1.5.

comment:4 Changed 4 years ago by hal.murray

I patched /usr/sbin/powerd to ignore network activity.

ping -f killed it, but it took a while. (?15 minutes)

The WiFi LED was on. It recovered when I poked the touchpad.

It's possible that this will turn out to be a duplicate of #9960, but the WiFi LED was out on the event that started this bug and it didn't recover when I poked the touchpad.

comment:5 Changed 4 years ago by hal.murray

I just hit another example of this case:

No wakeup from a WiFi packet
The WiFi LED off
Wakeup from the touchpad worked.
On wakeup, the WiFi LED blinked a few times
After wakeup the LED was off and ping didn't work.

This is with powerd patched to not notice network activity.

I was using ping -i 10. It was suspending on every other packet. It died after 52 packets.

comment:6 Changed 4 years ago by hal.murray

Since then, I've seen two more examples:

No response to pings (with hard-wired arp)
WiFI LED is out
wakeup from touchapad works but doens't fixup WiFi.

I'm using ping -i 10 as as a test case. It only takes a few minutes.

One sample was 30 packets. Another was 81 packets.

comment:7 Changed 4 years ago by hal.murray

More testing on os301:

There are two different cases.

I've seen one more example of the bug that started this entry.

WiFi LED off. No response to pings.

Poking touchpad wakes it up.

WiFi LED blinks, but doesn't turn on.

Pings still don't work.

Another case is where the wakeup from the WiFi gets lost.
I've seen 4 more examples of this. It may be a duplicate of #9960.

The LED is still on.

Poking the touchpad wakes it up.

After that, ping works.

I'm using ping -i 10 as my test case. This is with arp hard-wired and
with powerd patched to ignore network traffic as a reason to avoid
suspend.

comment:8 Changed 4 years ago by hal.murray

Comments from Sascha

Excerpts from Hal Murray's message of Wed Jul 14 20:27:37 +0000 2010:

A simple edit makes suspend/wakeup from WiFi work on XO-1 running os300. So
I've been testing it. It works most of the time, but occasionally stops
waking up from WiFi. Wakeup from touchpad works fine.

There's a very annoying and easy to trigger race condition regarding
suspend. If a wake-up event happens while we are still suspending (either
between the last check in powerd and triggering the suspend using
rtcwake, and/or already on the kernel side) it will be ignored. For
input devices this isn't as bad because any further event (key press /
release, touchpad movement) will wake up the XO. But AIUI the libertas
chip will wait for the host (= CPU / kernel) to confirm the wake-up
event without ever retrying.

(delete the "a" from "ethtool -s eth0 wol ua" in /usr/sbin/powerd)

FWIW I'm using "pum". This allows IPv6 to work transparently (IPv6
uses multicast for neighbour discovery).
You could add "b" (broadcast) as a replacement for "a" (ARP), but
at least on some of the networks I'm connected to regularly this
is impractical (both Windows SMB and proprietary Cisco router
management broadcast traffic).

The WiFi LED is off. It blinks when I wake it up by poking the touchpad,
but the LED stays off and ping doesn't work.

This suggests you lost association to the access point while in suspend.
Unfortunately this seems to be a firmware bug (specific to the chip in
the XO-1, you won't have this problem on an XO-1.5) that we cannot
easily work around in the kernel. If the chip looses association while
in suspend, it neither wakes up the host nor tells it about having lost
association when the host eventually wakes up. As there's no way for the
host to query the current association status (at least I didn't find one
in the sparse documentation), the kernel has no way to detect this.
The blinking you see upon resume is a WiFi scan. I haven't looked into
who triggers this scan - presumably NetworkManager. But since the kernel
and thus user space never notices we lost association, this isn't really
of interest.
So far my tests suggested that WPA rekeying wakes up the host, so the
reason for disassociation is probably something else. Even without
suspend the XO-1 will disassociate about once or twice a day on
average which makes this plausible. But it's merely a loose time-based
correlation; I haven't done any specific test to be sure about it.

Feel free to copy this to Trac - I just didn't have time to do so yet
(and wanted to do some other tests first, but didn't get around to
that either).

Sascha

comment:9 Changed 4 years ago by sascha_silbe

  • Cc sascha_silbe added

comment:10 Changed 4 years ago by cjb

  • Milestone changed from Not Triaged to 10.1.2
  • Priority changed from normal to blocker

Proposing for 10.1.2 blocker triage.

comment:11 follow-up: Changed 4 years ago by cjb

I wonder if there's any way to detect that we had association before and don't have it now, after resume. If we ask the driver for details of the association (like signal strength etc) does it just lie to us?

comment:12 in reply to: ↑ 11 Changed 4 years ago by sascha_silbe

Replying to cjb:

I wonder if there's any way to detect that we had association before and don't have it now, after resume. If we ask the driver for details of the association (like signal strength etc) does it just lie to us?

The kernel driver will "lie" because it doesn't know better: the firmware doesn't inform the host about the state change and there doesn't seem to be a way to explicitly query the current state.
The only idea I had so far is telling the firmware to associate again right after resume. AFAICT all required information is cached in the private driver structures. I'll attach a WIP patch to do that - I don't remember if it even compiled (it's been a few weeks since I last worked on it), but at least it should give some idea of how I intended it to work.

comment:13 Changed 4 years ago by sascha_silbe

Attaching files (still?) doesn't work (ticket filed now), so here's a link instead: http://sascha.silbe.org/patches/0001-try-to-reassociate-after-resume-on-XO-1.patch

Changed 4 years ago by Quozl

test

Changed 4 years ago by sascha_silbe

WIP patch to re-associate after resume

comment:14 Changed 4 years ago by dsd

Reproduced this bug on 8.2.1 on an XO-1 -- this doesn't seem to be a regressoin. Upon resume after a short sleep, wifi did not work. According to the admin page of my AP, the XO was no longer associated. Reconnecting to the network restored connectivity.

comment:15 Changed 4 years ago by Quozl

We've decided to turn off idle-suspend on XO-1 as a workaround to this problem, if we need to ship before it is fixed.

comment:16 Changed 4 years ago by Quozl

  • Milestone changed from 10.1.2 to 10.1.3

comment:17 Changed 4 years ago by Quozl

  • Milestone changed from 10.1.3 to 10.1.2
  • Priority changed from blocker to high

Changed 4 years ago by hal.murray

dmesg from system that wasn't responding

Changed 4 years ago by hal.murray

/var/log/messages from system that wasn't responding

comment:18 Changed 4 years ago by hal.murray

I've got the back of a system off. Is there a handy place to get a
scope/meter on the wakeup signal from the WiFi module to
the EC? Is that signal edge triggered or level sensitive?

richard@… said:

Easiest place is on the WLAN module Pin 49. Its connected to EC pin 36.

Its an edge trigger signal and the WLAN should pulse it over and over if

it doesn't see communication from the host.

Thanks.

I soldered a handle on to pin 49. I've got a scope on it.

That line is normally high. I see a 30 ms pulse going low when it is suspended and a packet arrives.

I've seen a few cases of it not waking up when the left WiFi LED is on and the right one blinks. When that happens, there is nothing on the scope. I just got good data from
one of those events.

The last response was at 13:18
I woke it up at 13:46 by poking the touchpad.

I've attached 2 files: dmesg output and /var/log/messages
/sys/module/olpc/parameters/ecdebug was on

It had been up for 15 hours, suspending about once per minute, so there is lots of junk in /var/log/messages. (It was hung for several of those hours.)

Looks like the dmesg buffer is so full of new stuff that the interesting old stuff is gone.
Oh, well. Next time...

Changed 4 years ago by hal.murray

/var/log/messages with EC debug info for another deaf WiFi.

comment:19 Changed 4 years ago by hal.murray

Another deaf WiFi event.

In case it helps, I've attached a copy of /var/log/messages
that includes EC debugging info from a hung event.

Again, the Left WiFi LED was on, the right WiFi LED was blinking, and there was no activity on pin 49 of the WiFi module.

It had been idle/hung/deaf for roughly a half hour before I noticed it.
Aug 27 01:49:02 xo-0d-57-33 kernel: [ 1277.559247] olpc-ec: sending cmd arg 0xff
Aug 27 02:12:27 xo-0d-57-33 kernel: [ 1278.637465] PM: Syncing filesystems ... done.

[The timestamps in the log file are the old times until the kernel updates the time on the wakeup path.]

comment:20 Changed 4 years ago by pgf

okay. i think this confirms it's a wlan firmware bug. you're not seeing activity on the wakeup line, and the EC doesn't have the wakeup masked (the 0x1b command sets the wakeup mask with the arg 0xff).

hal -- what traffic are you sending to the wlan (from another system) when this occurs? i assume you've taken care of locking down the arp cache somehow? and the traffic is a ping of some sort? it's possible you're seeing a wakeup race -- i.e., the wlan could send up a packet while the system is going to sleep, and if no other packet is received, or the initial packet isn't taken by the kernel, the wlan module might never do a wakeup.

if you're doing pings, perhaps it's possible to tell more using which packets were received and which were dropped.

we should also start thinking about workarounds. e.g., if we wake once a minute, on a timer, will that recover well enough? i'm picturing going right back to sleep unless the network traffic filters notice activity right after that wakeup.

(obviously fixing the firmware would be the right thing.)

comment:21 Changed 4 years ago by hal.murray

hal -- what traffic are you sending to the wlan (from another system) when
this occurs? i assume you've taken care of locking down the arp cache
somehow? and the traffic is a ping of some sort? it's possible you're
seeing a wakeup race -- i.e., the wlan could send up a packet while the
system is going to sleep, and if no other packet is received, or the
initial packet isn't taken by the kernel, the wlan module might never do a
wakeup.

I'm running a fake ntp server and poking it with real ntpd on 2 PCs.
One of them pokes it every 64 seconds. The other varies from 64 to 256 seconds.

On both of those PCs, the arp slot for the XO was setup manually.

I've got boatloads of ntp logging enabled on both PCs, but ntpd only
logs things when a packet arrives not when it sends one. We can probably
figure out when the first packet was lost by stepping forward from the
last one received.

we should also start thinking about workarounds. e.g., if we wake once a
minute, on a timer, will that recover well enough? i'm picturing going
right back to sleep unless the network traffic filters notice activity
right after that wakeup.

I'd expect that to work in this case. It recovers when I poke the touchpad.

But this is only half of the problem. There is also the case where the LED
is off. That doesn't recover when I poke the touchpad. It does recover if
I poke the icon for my AP so we can probably debug that path but it will take
some more work.


comment:22 Changed 4 years ago by Quozl

  • Milestone changed from 10.1.2 to 10.1.3

We have temporarily disabled idle suspend for os852, for this ticket and others, therefore we can push this ticket out to 10.1.3.

Changed 4 years ago by hal.murray

dmesg after hang with LED off

Changed 4 years ago by hal.murray

/var/log/messages after hang with LED off

Changed 4 years ago by hal.murray

/var/log/wpa-supplicant.log after hang with LED off

comment:23 Changed 4 years ago by hal.murray

I got some data from the LED-off type of hang. 3 files attached.

The WiFi LED was off. When I poked a key, it woke up.
The only AP showing was mine. Usually I have several more.
When I poked the AP, it reconnected.
After a while a few more APs showed up.

Here is a quirk that may be important or may be a wild goose chase.

Normally, my setup hangs in a few hours. (It may hang with the LED on
or off: two separate cases.)

I found a note on the web about getting more info by adding "-dddt" to the Exec line in

/usr/share/dbus-1/system-services/fi.epitest.hostap.WPASupplicant.service

After adding that, it ran for a day or two without problems.
It hung shortly (few hours) after I backed out that change.

Changed 4 years ago by hal.murray

/var/log/messages from LED-off hang with new powerd

Changed 4 years ago by hal.murray

/var/log/wpa_supplicant.log from LED-off hang with new powerd

Changed 4 years ago by hal.murray

/var/log/powerd.trace from LED-off hang with new powerd

comment:24 Changed 4 years ago by hal.murray

My previous comment about logging in wpa_supplicant helping was probably bogus.

I've added log files for a LED-out hang, including powerd.trace from the recent powerd.

comment:25 Changed 4 years ago by martin.langhoff

  • Action Needed changed from never set to diagnose

comment:26 Changed 4 years ago by erikos

  • Milestone changed from 10.1.3 to Future Release

Moving out, decision in triage meeting.

comment:27 Changed 4 years ago by martin.langhoff

  • Milestone changed from Future Release to 11.2.0-M4

Needs to be revisited with F14 kernel.

comment:28 Changed 3 years ago by dsd

  • Component changed from not assigned to wireless
  • Owner set to dsaxena
  • Priority changed from high to low

comment:29 Changed 3 years ago by sridhar

  • Cc sridhar added
Note: See TracTickets for help on using tickets.