Ticket #8667 (closed defect: fixed)

Opened 6 years ago

Last modified 6 months ago

wpa_supplicant generates many redundant scan requests before associating

Reported by: jcardona Owned by: dsaxena
Priority: high Milestone: 8.2.1
Component: wireless Version: not specified
Keywords: 8.2.1:+ Cc: mbletsas, jcardona, carrano, ashishs, joe, cjl, cjb
Action Needed: unknown Verified: no
Deployments affected: Blocked By: #9053
Blocking:

Description

Below is the dmesg log (libertas_debug=0x180) for one instance of this problem. This is the log for one individual association request via wpa_supplicant. The "scan table:" line is printed at the end of a full scan (all channels). In each output we see that the requested AP (SSID=cozybit) is found and reported. Yet it takes 8 full scans before an association request is formed.

cat dmesg0.log | grep -e cozybit -e 'scan table' -e 'looking' -e flag
[85036.478574] libertas scan: scan table:
[85036.478574] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 58,
SSID 'cozybit'
[85037.979206] libertas scan: got SSID IE: 'cozybit', len 7
[85039.914010] libertas scan: scan table:
[85039.914051] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 55,
SSID 'cozybit'
[85041.418576] libertas scan: got SSID IE: 'cozybit', len 7
[85043.367222] libertas scan: scan table:
[85043.367256] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 59,
SSID 'cozybit'
[85044.860520] libertas scan: got SSID IE: 'cozybit', len 7
[85046.771237] libertas scan: scan table:
[85046.771278] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 58,
SSID 'cozybit'
[85048.266732] libertas scan: got SSID IE: 'cozybit', len 7
[85050.226451] libertas scan: scan table:
[85050.226485] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 57,
SSID 'cozybit'
[85051.729253] libertas scan: got SSID IE: 'cozybit', len 7
[85053.673176] libertas scan: scan table:
[85053.673211] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 57,
SSID 'cozybit'
[85055.166316] libertas scan: got SSID IE: 'cozybit', len 7
[85057.118583] libertas scan: scan table:
[85057.118583] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 58,
SSID 'cozybit'
[85058.616951] libertas scan: got SSID IE: 'cozybit', len 7
[85060.553657] libertas scan: scan table:
[85060.553690] libertas scan: 00: BSSID 00:13:10:11:11:11, RSSI 56,
SSID 'cozybit'
[85061.058737]     flags:     0x00000c36
[85061.058746]     SSID:      'cozybit'
[85061.165825] libertas assoc: ASSOC: associated to 'cozybit', 00:13:10:11:11:11

Attachments

0001-libertas-Do-not-schedule-a-scan-when-one-is-in-prog.patch (2.5 kB) - added by jcardona 6 years ago.
scan.ok.tgz (120.0 kB) - added by carrano 6 years ago.
dmesg output for successful scanning operation
scan.fail.tgz (30.0 kB) - added by carrano 6 years ago.
dmesg output for failedl scanning operation
scan-nomesh.tgz (50.0 kB) - added by carrano 6 years ago.
scans (fail and success examples) with mesh disabled
assoc-fail-nomesh (62.5 kB) - added by carrano 6 years ago.
a failed association with mesh disabled
fail.Harumaki (13.4 kB) - added by carrano 6 years ago.
failed association with test firmware and test kernel
scan.22p255.fail.1 (0.9 kB) - added by carrano 6 years ago.
failed scanning eith test firmware and test kernel
scan.22p255.ok.1 (3.7 kB) - added by carrano 6 years ago.
successful scanning with test firmware and test kernel
scan.22p255.nomesh.ok.1 (3.0 kB) - added by carrano 6 years ago.
successful scanning with test firmware and test kernel (mesh disabled)

Change History

  Changed 6 years ago by jcardona

The attached patch has cut down the (worst case) number scans required to associate from 8 to 2. Test logs below.

./7825.sh 
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:1
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:1
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:1
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:2
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:1
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:2
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:1
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:2
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP0 (WPA2)...COMPLETED scans:0
Setting debug level to 0x182
Trying AP1 (WEP40)...COMPLETED scans:1
PASS

# cat /proc/kmsg | grep -e et_scan -e igno
<7>[76121.840982] libertas leave: lbs_get_scan():1480, ret 0
<7>[76121.842305] libertas leave: lbs_set_scan():1405
<7>[76124.382665] libertas leave: lbs_get_scan():1480, ret 0
<7>[76128.392754] libertas leave: lbs_get_scan():1480, ret 0
<7>[76132.889357] libertas leave: lbs_get_scan():1480, ret 0
<7>[76137.206734] libertas leave: lbs_get_scan():1480, ret 0
<7>[76152.453230] libertas leave: lbs_set_scan():1405
<7>[76154.919081] libertas leave: lbs_get_scan():1480, ret -11
<7>[76154.919081] libertas scan: scan in progress: ignore scan req
<7>[76154.919081] libertas leave: lbs_set_scan():1405
<7>[76185.372016] libertas leave: lbs_get_scan():1480, ret 0
<7>[76190.458637] libertas leave: lbs_set_scan():1405
<7>[76192.969382] libertas leave: lbs_get_scan():1480, ret 0
<7>[76197.221853] libertas leave: lbs_get_scan():1480, ret 0
<7>[76201.548795] libertas leave: lbs_get_scan():1480, ret 0
<7>[76205.991586] libertas leave: lbs_get_scan():1480, ret 0
<7>[76210.335700] libertas leave: lbs_get_scan():1480, ret 0
<7>[76210.335700] libertas leave: lbs_set_scan():1405
<7>[76212.839388] libertas leave: lbs_get_scan():1480, ret 0
<7>[76216.962027] libertas leave: lbs_get_scan():1480, ret 0
<7>[76221.291161] libertas leave: lbs_get_scan():1480, ret 0
<7>[76236.496878] libertas leave: lbs_set_scan():1405
<7>[76238.957768] libertas leave: lbs_get_scan():1480, ret -11
<7>[76238.957768] libertas scan: scan in progress: ignore scan req
<7>[76238.957768] libertas leave: lbs_set_scan():1405
<7>[76269.285934] libertas leave: lbs_get_scan():1480, ret 0
<7>[76274.354219] libertas leave: lbs_set_scan():1405
<7>[76276.872398] libertas leave: lbs_get_scan():1480, ret 0
<7>[76281.093796] libertas leave: lbs_get_scan():1480, ret 0
<7>[76286.434995] libertas leave: lbs_get_scan():1480, ret 0
<7>[76301.643573] libertas leave: lbs_set_scan():1405
<7>[76304.068899] libertas leave: lbs_get_scan():1480, ret 0
<7>[76308.684546] libertas leave: lbs_get_scan():1480, ret 0
<7>[76312.983518] libertas leave: lbs_get_scan():1480, ret 0
<7>[76328.227136] libertas leave: lbs_set_scan():1405
<7>[76330.672265] libertas leave: lbs_get_scan():1480, ret -11
<7>[76330.672265] libertas scan: scan in progress: ignore scan req
<7>[76330.672265] libertas leave: lbs_set_scan():1405
<7>[76361.076771] libertas leave: lbs_get_scan():1480, ret 0
<7>[76366.186987] libertas leave: lbs_set_scan():1405
<7>[76368.686105] libertas leave: lbs_get_scan():1480, ret 0
<7>[76372.984438] libertas leave: lbs_get_scan():1480, ret 0
<7>[76377.294969] libertas leave: lbs_get_scan():1480, ret 0
<7>[76381.772545] libertas leave: lbs_get_scan():1480, ret 0
<7>[76386.071862] libertas leave: lbs_get_scan():1480, ret 0
<7>[76386.072918] libertas leave: lbs_set_scan():1405
<7>[76388.535205] libertas leave: lbs_get_scan():1480, ret 0

  Changed 6 years ago by carrano

Deepak,

Could you commit the provided patch to the proper branch so we can test it? Thank you!

follow-up: ↓ 4   Changed 6 years ago by cjb

  • cc mbletsas added
  • owner changed from mbletsas to dsaxena
  • next_action changed from never set to package

Ricardo, Deepak wasn't copied on the bug.

(The testing branch is closed for release at the moment, so this might want to wait a few days until an 8.2.0 image is out the door.)

in reply to: ↑ 3   Changed 6 years ago by carrano

Replying to cjb:

Ricardo, Deepak wasn't copied on the bug.

My mistake (lost to concurrency when updating and then did not add the cc again)

(The testing branch is closed for release at the moment, so this might want to wait a few days until an 8.2.0 image is out the door.)

Isn't there any other option? Deepak, could we make a private build or something like that? I used to compile my own test kernels, but since 2.6.25 the rpmbuild process needs changing.

follow-up: ↓ 7   Changed 6 years ago by mbletsas

  • cc jcardona, carrano added

Deepak,

Please make us a kernel rpm with this patch asap.

Thanks,

M.

  Changed 6 years ago by mbletsas

  • priority changed from normal to high
  • milestone changed from Not Triaged to 8.2.1

in reply to: ↑ 5   Changed 6 years ago by dsaxena

Replying to mbletsas:

Deepak, Please make us a kernel rpm with this patch asap. Thanks, M.

Just saw this. Will get an RPM uploaded soon.

  Changed 6 years ago by dsaxena

http://dev.laptop.org/~dsaxena/kernel-2.6.25-20080926.1.olpc.5f842a369c6d018.i586.rpm

This is for testing purposes only and not an official PRM For release.

  Changed 6 years ago by carrano

The bug behavior (scanning and association)

I installed the latest test kernel above and concentrated on the scanning operations. Here are some pieces of the puzzle that were exhaustively tested and confirmed:

  • Whenever the association to my AP fails (GUI), an scanning in the command line, issued immediately after the assoc fail *will also fail* with "interface doesn't support scanning: network is down" (logs will be attached)
  • Successive scanning commands after a failed one will eventually work, with unknown reason
  • If a scanning is successful, successive attempts will *always* be successful UNLESS an attempt to associate to the AP is done, in which case the scanning will start failing again.

Some logs for fail or success are attached in two tar balls to this ticket.

This scanning issues partial state machine derived from the above (sorry for the not very formal representation):

  • If (previous scan was ok) and (action is scan) => scan succeeds
  • If (previous scan was ok) and (action is try to associate) => scan fails
  • If (previous scan failed) and (action is scan) => scan result is unknown but 5 times more likely to fail)

The user experience In terms of user experience, the details for a failure are:

1 - The user clicks on the AP icon, the GUI will either ask for a key or not (because it has it already)

2 - The AP circle starts blinking and the keyhole led blinks fast for some seconds, remains off for a second, on for a second, and finally goes off definitely

3 - Immediately after 2, the AP stops to blink, the association fails, the message "connecting" disappears and "connect" option shows up again.

Changed 6 years ago by carrano

dmesg output for successful scanning operation

Changed 6 years ago by carrano

dmesg output for failedl scanning operation

  Changed 6 years ago by carrano

The dmesg outputs for fail and for success attached were taken with:

dmesg -c; sleep 1; iwlist eth0 scan; sleep 20; dmesg | tee <output-file>

So, they register 20 seconds after the scanning command in issued.

  Changed 6 years ago by carrano

  • cc ashishs added

Analyzing the failing scans from this log one thing that caught my attention is that among the last messages you'll almost always find at the end of the output:

ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready

This happened in 6 of the 7 fails reported bellow:

for i in scan.fail.*; do echo $i; grep -v olpc- $i | tail -n 6; done

scan.fail.1
[ 1228.616204] libertas scan: got mesh IE
[ 1228.616230] libertas scan: SCAN_RESP: BSSID 02:2c:bc:96:dd:89
[ 1228.616283] libertas scan: scanning 4 of 9 channels
[ 1229.065121] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 1229.065153] libertas scan: SCAN_RESP: scan results 0
[ 1229.065206] libertas scan: scanning 4 of 5 channels
scan.fail.2
[ 1503.809944] libertas scan: SCAN_RESP: scan results 0
[ 1503.809944] libertas scan: scanning 1 of 1 channels
[ 1503.930637] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 1503.930637] libertas scan: SCAN_RESP: scan results 0
[ 1503.930637] libertas scan: scan table:
[ 1503.930637] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
scan.fail.3
[ 1573.899593] libertas scan: SCAN_RESP: scan results 0
[ 1573.899593] libertas scan: scanning 1 of 1 channels
[ 1574.020817] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 1574.020817] libertas scan: SCAN_RESP: scan results 0
[ 1574.020817] libertas scan: scan table:
[ 1574.020817] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
scan.fail.4
[ 2101.752710] libertas scan: scanning 1 of 1 channels
[ 2101.874709] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 2101.874709] libertas scan: SCAN_RESP: scan results 0
[ 2101.874709] libertas scan: scan table:
[ 2101.874709] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
[ 2112.895701] msh0: no IPv6 routers present
scan.fail.5
[ 2149.298615] libertas scan: scanning 1 of 1 channels
[ 2149.422407] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 2149.422440] libertas scan: SCAN_RESP: scan results 0
[ 2149.422516] libertas scan: scan table:
[ 2149.423202] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
[ 2160.405694] msh0: no IPv6 routers present
scan.fail.6
[ 2172.706742] libertas scan: scanning 1 of 1 channels
[ 2172.823242] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 2172.823654] libertas scan: SCAN_RESP: scan results 0
[ 2172.825204] libertas scan: scan table:
[ 2172.825416] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
[ 2183.093512] msh0: no IPv6 routers present
scan.fail.7
[ 2218.281645] libertas scan: SCAN_RESP: scan results 0
[ 2218.281645] libertas scan: scanning 1 of 1 channels
[ 2218.399022] libertas scan: SCAN_RESP: bssdescriptsize 0
[ 2218.399055] libertas scan: SCAN_RESP: scan results 0
[ 2218.404603] libertas scan: scan table:
[ 2218.405089] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready

  Changed 6 years ago by carrano

Ashish,

Would you please take a look at it and check if the firmware is playing a role here?

Thank you! Ricardo

  Changed 6 years ago by carrano

As suggested by Michail, I repeated the tests described above here with mesh disabled:

echo 0 > /sys/class/net/eth1/lbs_mesh

The repeatable results are:

  • Association still fails as in the previous test, only this time it keeps trying longer (see assoc-fail-nomesh attached file)
  • Scanning after the failed assoc will also fail, but only if performed immediately after the failed association attempt.
  • Scanning will not fail two times in a row. (but they are being performed with intervals longer than 20 seconds)
  • All the times scanning fails, the first line of dmesg output will be (see fails logs in scan.nomesh.tgz):
    ADDRCONF(NETDEV_UP): eth0: link is not ready
    

Changed 6 years ago by carrano

scans (fail and success examples) with mesh disabled

Changed 6 years ago by carrano

a failed association with mesh disabled

  Changed 6 years ago by ashish

Can someone give a try with debug firmware version 0.0.0.p20 http://dev.laptop.org/attachment/ticket/8666/usb8388_0_0_0_p20.tar.gz and post the dmesg log? It will help us to nail down the exact problem.

  Changed 6 years ago by ashish

I am sorry missed out the fact the driver no longer accept zero version string. Please use http://dev.laptop.org/attachment/ticket/8666/usb8388_5_110_22_p255.tar.gz to test. Thanks!

  Changed 6 years ago by carrano

Repeated the association and scanning tests again with the latest test kernel, but this time with the test firmware

I am attaching fail.Harumaki, which is a dmesg output (scanning and association debug on) for a failed association to SSID Harumaki (WPA AP)

From a user perspective, association hits seems more frequent now, but it is really hard to conclude anything from this observation now, since the variance in all the tests so far are still very high.

Also repeated the tests in which a scanning is performed immediately after a failed association. The only debug flag on was scan.

dmesg -c; sleep 1; iwlist eth0 scan; sleep 20; dmesg | tee <output-file>

Scanning ok, after a failed association: scan.22p255.ok.1

Scanning fails, after a failed association: scan.22p255.fail.1

Mesh disabled, scanning ok, after a failed association: scan.22p255.nomesh.ok.1

Mesh disabled, scanning ok, after a failed association: curiously dmesg register no output during this 20 seconds in this case (repeated this three times)

No firmware "events" were registered in any of the above tests.

Changed 6 years ago by carrano

failed association with test firmware and test kernel

Changed 6 years ago by carrano

failed scanning eith test firmware and test kernel

Changed 6 years ago by carrano

successful scanning with test firmware and test kernel

Changed 6 years ago by carrano

successful scanning with test firmware and test kernel (mesh disabled)

  Changed 6 years ago by ashish

Ricardo,

The missing debug events, which is a good sign, most likely indicate that firmware is not playing any role in failed scan/association. When scan fails do you see command timeout?

In the failed case Harumaki, is_network_compatible()failed and therefore association failed as well. Thanks

  Changed 6 years ago by mstone-xmlrpc

  • keywords cjbfor9.1.0 added
  • milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

  Changed 6 years ago by edmcnierney

  • milestone changed from 9.1.0 to 8.2.1

Resetting to 8.2.1 as this is a requested fix for that release.

  Changed 6 years ago by mbletsas

  • blockedby 9053 added

  Changed 6 years ago by edmcnierney

  • keywords 8.2.1:+ added; cjbfor9.1.0 removed

  Changed 6 years ago by cscott

Michalis has sent a kernel and firmware that claim to address this via email; that information should be added here. See also #8799.

  Changed 6 years ago by joe

  • cc joe added

  Changed 6 years ago by dsaxena

All the fixes for #8799, #8667, and #9048 should be in the following RPM, that also includes the Fedora logo update (#8767), psmouse tweaks (#9008), and some minor configuration and build changes that do not impact runtime.

http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20081217.24.olpc.31e049fce071030.i586.rpm

  Changed 6 years ago by cjb

  • next_action changed from package to test in release

Hard to know what the state of this bug is, but going on Deepak's last comment I'll move it to "test in release".

  Changed 6 years ago by cscott

  • next_action changed from test in release to add to build

Please put the appropriate RPMs in yr ~/public_rpms/staging on dev. (Also in joyride, if appropriate.)

  Changed 6 years ago by cscott

dsaxena: you need a ChangeLog or *.changes file in your public_rpms/staging directory, see http://wiki.laptop.org/go/Build_system#Instructions_for_Use

  Changed 6 years ago by cjb

  • next_action changed from add to build to test in release

Verified that kernel-2.6.25-20081217.24 is present in staging-9.

  Changed 6 years ago by cjb

  • next_action changed from test in release to test in build

  Changed 4 years ago by cjl

  • cc cjl, cjb added

#9053 claims to have fixed this. Can this ticket be closed?

  Changed 6 months ago by Quozl

  • status changed from new to closed
  • next_action changed from test in build to unknown
  • resolution set to fixed

Not seen since, closing.

Note: See TracTickets for help on using tickets.