Ticket #841 (closed defect: fixed)

Opened 17 months ago

Last modified 17 months ago

reproducible loss of wireless association

Reported by: Quozl Owned by: marcelo
Priority: blocker Milestone: BTest-3
Component: wireless Version:
Keywords: Cc: rchokshi@…
Action Needed: Verified: yes
Blocked By: Blocking:

Description

On build 239, B-Test-1 units, wireless association is lost while playing lots of small loud sounds and repeated calling of iwconfig.

When running the af test script below, initially the tones appear with normal regularity. After some time but under a minute, the rate of tones slows down to once every few seconds, cause unknown. At or near this time, the wireless LED begins to flash. Signal level may go to 0 or to -256 just before association is lost. The wireless LED is extinguished. Association may or may not be regained.

/usr/local/bin/signal-level:

#!/bin/sh
/sbin/iwconfig eth0|\
awk '/Signal level/ {print $4}'|\
cut -f2 -d=

/usr/local/bin/af:

#!/bin/sh
while true; do
    aplay /tmp/sine`signal-level`.wav
done

/usr/local/bin/af-table-create:

#!/bin/sh
for x in `seq -90 -30`; do
    FREQ=`echo "(90$x)*100+500"|bc`
    sox -t nul /dev/null -r 44100 -w /tmp/sine$x.wav \
        synth 00:00:00.05 sine create $FREQ
done

(this last script is normally run on another system, with the resulting files copied to the test unit).

Change History

  Changed 17 months ago by jg

  • cc marcelo added
  • owner changed from marcelo to dcbw
  • priority changed from low to high
  • milestone changed from Untriaged to BTest-3

Cool. Nice bug. I think this is likely serious and high priority, as I suspect you are tickling the firmware in some way that would come back to bite us and cause failure under NM, just after a much longer time.

I'm assigning this to dcbw for investigation; I suspect he'll find it to be firmware.

  Changed 17 months ago by Quozl

Does not occur in ad-hoc mode. So I wonder now if it relates to my access point in particular. Needs testing by others.

  Changed 17 months ago by jg

There have been quite a few problems in the association code in the past, so it seems likely it is on our end.

What access point are you using, specifically?

  Changed 17 months ago by Quozl

PowerNOC WOAP54G 802.11g Outdoor Access Point running OpenWRT White Russian RC5 from March 2006.

Unrelated ... did a two-laptop ad-hoc walkabout just then over a 2km path back to the base unit, the tweeting never faltered. So it is probably more than likely related to association. I may try to set up a B-Test-1 unit as an access point and see how it goes.

  Changed 17 months ago by dcbw

Is this with a manual association or with NetworkManager? If its with NM, try doing manual association with iwconfig and adding an IP address with ifconfig, and see if that's any different. Note that NM will periodically scan for wireless networks, which you can duplicate for testing purposes using 'iwlist eth0 scan'.

  Changed 17 months ago by marcelo

  • owner changed from dcbw to marcelo
  • priority changed from high to blocker
  • status changed from new to assigned

Note: I've seen behaviour where certain firmware commands which are generated from running repeated "iwconfig" causes the device to be non-functional (I forgot the details at the moment).

I'll take a closer look at this (ie try to reproduce it) this week, if Dan does not beat me to it.

  Changed 17 months ago by marcelo

Note2: this only happened when signal strength was very low.

  Changed 17 months ago by marcelo

  • owner changed from marcelo to dcbw
  • status changed from assigned to new

  Changed 17 months ago by marcelo

  • owner changed from dcbw to marcelo

  Changed 17 months ago by khassounah

We are seeing this in Libya. It is happening in 3 out of 4 machines repeatidly. You connect, see other machines in mesh view, browse and then you loose your connection.

In some cases we were iwconfiging, but in many cases, we just verified the connection works and left the machine, and it disconnects after 5 minutes. On two of those machines, it refuses to reconnect, except if we turn off the machine for few minutes.

The room is not warm (I would say 20-25 celsius) and the machines are powered. Antennas are up and the access point is sitting right next to them, so the signal is very strong. Build 239 on B1 machines.

One of those laptops and my thinkpad laptop are very good at keeping the connection to the access point, so it is probably not an access point issue.

For the record, the access point is a dlink DWL-2100AP

  Changed 17 months ago by marcelo

  • status changed from new to assigned

khassounah,

Do you see the problem without using mesh networking (just trying to isolate the problem).

  Changed 17 months ago by marcelo

Quozl,

I can't reproduce the problem with an A-test board running build193. Nice shell scripts btw.

Can't recall which version of the firmware this is running, but it has an md5sum of:

-bash-3.1# md5sum /lib/firmware/usb8388.bin 421964206b0699014a11256ac48dd0a8 /lib/firmware/usb8388.bin

Downloading build239 now for testing...

follow-up: ↓ 14   Changed 17 months ago by marcelo

Please execute the following commands to capture debugging output:

# rmmod usb8xxx ; modprobe usb8xxx libertas_debug=1

After reproducing the problem, gzip /var/log/messages and attach it to this entry.

Thanks

in reply to: ↑ 13   Changed 17 months ago by Quozl

Reproduced as requested.

The log could not be attached, trac reported an internal error, exceeded attachment size. The log has been placed on:

In the log, the logger command was used to mark the time of the rmmod and modprobe, and to mark the end of the test.

At least two long pauses happened during the test, during which the wireless light was extinguished. The last pause was within a minute or so of the end of the test. During this test, two other B-Test-1 units were associated with the access point.

(I should really have added a logger trigger inside my af loop).

follow-up: ↓ 18   Changed 17 months ago by marcelo

Thanks for the log.

The problem is not related to the low signal strength issue I was saw in the past.

What happpens is that association is lost suddenly. During this window, its possible to verify that the wireless device can't communicate with the AP since the periodic scanning performed by NM gives no results. Analyzing the log file, we see that the first instance of the problem happens during the window between lines 27155 and 28263 of your log file:

[marcelo@dmt libertas]$ cat process-signal.sh
grep -n Signal quozl-2-var-log-messages-841.txt | while read i ; do
SIGNAL=`echo $i | cut -f 2 -d "="`
DBM=`/home/marcelo/wireless/to-dbm $SIGNAL`
LINE=`echo $i | cut -f 1 -d ":"`
echo $SIGNAL=$DBM \($LINE\)
done

0xcc=-52 (24826)
0xcd=-51 (26334)
0xcd=-51 (27155)
0x0=-0 (27246)
0xd0=-48 (28263)
0xd0=-48 (29062)
0xce=-50 (29715)
0x0=-0 (29806)
0xce=-50 (30823)
0xce=-50 (31459)
0xce=-50 (32287)
0x0=-0 (32391)
0xcc=-52 (33390)

Closer look at the scan response during such window:

[marcelo@dmt libertas]$ egrep -n "\[ 1289.|\[ 1290" quozl-2-var-log-messages-841.txt  | grep SCAN_RESP
26813:Jan 31 21:18:33 localhost kernel: [ 1289.155856] SCAN_RESP: BSSDescriptSize 74
26814:Jan 31 21:18:33 localhost kernel: [ 1289.155884] SCAN_RESP: returned 1 AP before parsing
26815:Jan 31 21:18:33 localhost kernel: [ 1289.155917] SCAN_RESP: tlvBufSize = 12
26821:Jan 31 21:18:33 localhost kernel: [ 1289.156666] Ssid:               quozl.linux.org.au<6>SCAN_RESP: BSSID = 00:13:10:44:5f:ad
26822:Jan 31 21:18:33 localhost kernel: [ 1289.156743] SCAN_RESP: Scanned  1 APs, 1 valid, 1 total
26862:Jan 31 21:18:33 localhost kernel: [ 1289.274835] SCAN_RESP: BSSDescriptSize 0
26863:Jan 31 21:18:33 localhost kernel: [ 1289.274866] SCAN_RESP: returned 0 AP before parsing
26864:Jan 31 21:18:33 localhost kernel: [ 1289.274896] SCAN_RESP: tlvBufSize = 0
26865:Jan 31 21:18:33 localhost kernel: [ 1289.274956] SCAN_RESP: Scanned  0 APs, 0 valid, 1 total
26905:Jan 31 21:18:34 localhost kernel: [ 1289.503205] SCAN_RESP: BSSDescriptSize 0
26906:Jan 31 21:18:34 localhost kernel: [ 1289.503237] SCAN_RESP: returned 0 AP before parsing
26907:Jan 31 21:18:34 localhost kernel: [ 1289.503267] SCAN_RESP: tlvBufSize = 0
26908:Jan 31 21:18:34 localhost kernel: [ 1289.503327] SCAN_RESP: Scanned  0 APs, 0 valid, 1 total
27205:Jan 31 21:18:34 localhost kernel: [ 1289.677205] SCAN_RESP: BSSDescriptSize 0
27206:Jan 31 21:18:34 localhost kernel: [ 1289.677236] SCAN_RESP: returned 0 AP before parsing
27207:Jan 31 21:18:34 localhost kernel: [ 1289.677266] SCAN_RESP: tlvBufSize = 0
27208:Jan 31 21:18:34 localhost kernel: [ 1289.677324] SCAN_RESP: Scanned  0 APs, 0 valid, 0 total
27326:Jan 31 21:18:34 localhost kernel: [ 1290.123073] SCAN_RESP: BSSDescriptSize 0
27327:Jan 31 21:18:34 localhost kernel: [ 1290.123103] SCAN_RESP: returned 0 AP before parsing
27328:Jan 31 21:18:34 localhost kernel: [ 1290.123134] SCAN_RESP: tlvBufSize = 0
27329:Jan 31 21:18:34 localhost kernel: [ 1290.123192] SCAN_RESP: Scanned  0 APs, 0 valid, 0 total
27425:Jan 31 21:18:34 localhost kernel: [ 1290.242689] SCAN_RESP: BSSDescriptSize 0
27426:Jan 31 21:18:34 localhost kernel: [ 1290.242720] SCAN_RESP: returned 0 AP before parsing
27427:Jan 31 21:18:34 localhost kernel: [ 1290.242750] SCAN_RESP: tlvBufSize = 0
27428:Jan 31 21:18:34 localhost kernel: [ 1290.242809] SCAN_RESP: Scanned  0 APs, 0 valid, 0 total
27468:Jan 31 21:18:35 localhost kernel: [ 1290.689316] SCAN_RESP: BSSDescriptSize 74
27469:Jan 31 21:18:35 localhost kernel: [ 1290.689345] SCAN_RESP: returned 1 AP before parsing
27470:Jan 31 21:18:35 localhost kernel: [ 1290.689378] SCAN_RESP: tlvBufSize = 12
27476:Jan 31 21:18:35 localhost kernel: [ 1290.690130] Ssid:               quozl.linux.org.au<6>SCAN_RESP: BSSID = 00:13:10:44:5f:ad
27477:Jan 31 21:18:35 localhost kernel: [ 1290.690208] SCAN_RESP: Scanned  1 APs, 1 valid, 1 total
27517:Jan 31 21:18:35 localhost kernel: [ 1290.808295] SCAN_RESP: BSSDescriptSize 0
27518:Jan 31 21:18:35 localhost kernel: [ 1290.808326] SCAN_RESP: returned 0 AP before parsing
27519:Jan 31 21:18:35 localhost kernel: [ 1290.808356] SCAN_RESP: tlvBufSize = 0
27520:Jan 31 21:18:35 localhost kernel: [ 1290.808416] SCAN_RESP: Scanned  0 APs, 0 valid, 1 total

Quozl, one suspicion is that NM's periodic scanning triggers the lack of AP visibility issue.

Can you please try to reproduce the problem without NM running?

In the meantime, I'll look into finding a similar access point with OpenWRT to reproduce the problem.

  Changed 17 months ago by marcelo

So the PowerNOC WOAP54G is basically a repacked Linksys WRT54G:

http://www.mikemcarthur.net/article.php?story=20051006023224178

  Changed 17 months ago by marcelo

  • cc rchokshi@… added; marcelo removed

in reply to: ↑ 15   Changed 17 months ago by Quozl

Replying to marcelo:

Can you please try to reproduce the problem without NM running?

Problem does not occur if I turn off NetworkManager.

The method I used to do this was:

/etc/init.d/NetworkManager stop
killall dhclient 2> /dev/null || true
ifconfig eth0 down
sleep 4
iwconfig eth0 mode managed channel 10 essid quozl.linux.org.au
dhclient eth0

follow-up: ↓ 20   Changed 17 months ago by dcbw

This should also be reproducible by doing a manual connection and running 'iwlist eth0 scan' every so often.

Scanning should _not_ cause the firmware or driver to loose the association to the AP ever. We should probably make sure that userspace-triggered scanning is async, so that each scan request actually triggers 3 or 4 firmware scan commands. They are already broken up like that in the driver, but the driver blocks on completion of all 3 before returning. The trick is to make sure you don't stay off the currently associated AP's channel for too long; so after scanning each channel or two (using ~ 200 - 250ms dwell time), jump back to the association channel for a bit, then scan two more channels, etc.

We should put some debugging code into the driver to see exactly how long a scan really takes, and how long the card is on a different channel than the current association channel. Another option is to have the card tell the AP to queue packets and such when it scans, which is similar to the notification cards send when they enter brief sleep for power saving.

in reply to: ↑ 19 ; follow-up: ↓ 21   Changed 17 months ago by rchokshi

Replying to dcbw:

This should also be reproducible by doing a manual connection and running 'iwlist eth0 scan' every so often. Scanning should _not_ cause the firmware or driver to loose the association to the AP ever. We should probably make sure that userspace-triggered scanning is async, so that each scan request actually triggers 3 or 4 firmware scan commands. They are already broken up like that in the driver, but the driver blocks on completion of all 3 before returning. The trick is to make sure you don't stay off the currently associated AP's channel for too long; so after scanning each channel or two (using ~ 200 - 250ms dwell time), jump back to the association channel for a bit, then scan two more channels, etc.

Ronak >> this is correct explanation. We usually fix the dwelling time to be 100ms maximum, assuming that the beacon interval on the APs is 100TUs. But if you want to scan a lot of channels at one time, you can sqeeeze the max channel dwelling time to be approx. 50ms and do an active scan (if you are currently doing a passive scan).

We should put some debugging code into the driver to see exactly how long a scan really takes, and how long the card is on a different channel than the current association channel. Another option is to have the card tell the AP to queue packets and such when it scans, which is similar to the notification cards send when they enter brief sleep for power saving.

Ronak >> This is also a trick that we do sometimes. Just that if this is really required, we will have to provide an additional hook to you guys to program the firmware to send a NULL Data packet to the AP. I will check into this.

in reply to: ↑ 20   Changed 17 months ago by rsmith

Replying to rchokshi:

Replying to dcbw:

This should also be reproducible by doing a manual connection and running 'iwlist eth0 scan' every so often.

I can reproduce this at will with the new WRT54GL we have here at Cambridge.

Start up the laptop normall and associate with the AP. Then switch to a console. Run the following:

while true
do
iwconfig eth0 | grep Signal
done

Then on a different console run 'iwlist eth0 scan'

If NetworkManager is running this will break. iwconfig takes a long time to ruturn and reports numbers like -256dBm. Pings also stop.

If NM is not running then this does not happen.

Marcelo: I've got a WRT54GL with your name on it waiting for you here at Cambridge for when you arrive.

follow-up: ↓ 24   Changed 17 months ago by marcelo

Quozl,

Can you please test build262?

The problem should be fixed.

  Changed 17 months ago by jg

Quozl,

Have you tested this?

I can also test at home: I have WRT54GL's. I'll try to get to this this week.

in reply to: ↑ 22   Changed 17 months ago by Quozl

Replying to marcelo:

Can you please test build262? The problem should be fixed.

Been testing with build 262, indeed I can't seem to cause the problem any longer. I can still cause it on build 239 on the other units.

  Changed 17 months ago by marcelo

  • status changed from assigned to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.