Opened 6 years ago

Last modified 4 years ago

#7825 assigned defect

can't complete WPA handshake with D-Link WBR-2310

Reported by: dsd Owned by: jcardona
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: wireless Version: not specified
Keywords: 8.2.0:? relnote Cc: jcardona, carrano, gregorio, holt, sph0lt0n, dsaxena, ashishs@…, mbletsas, cjl
Blocked By: Blocking:
Deployments affected: Action Needed: test in release
Verified: yes

Description

Testing a D-Link WBR-2310 in WPA mode. Authentication completes, association too, but WPA PTK/GTK handshake fails.

There are 4 handshake messages as usual (supplicant, authenticator, supplicant, authenticator) at which point the PTK is installed. It then enters group handshake mode but wpa_supplicant does not see the next EAPOL.

I ran a system in monitor mode next to it and confirmed that the AP did send the 5th EAPOL. It did not retransmit it, suggesting that it also received acknowledgement (my monitor does not show acks). So the message is being lost somewhere.

Interestingly, it is timing-sensitive too. If I run wpa_supplicant in verbose mode on the XO console, the extra delay imposed by printing a lot of messages results in WPA PTK+GTK completing. Odd.

Attachments (10)

lost_gtk_ticket_7825.cap (839.7 KB) - added by jcardona 6 years ago.
wireshark capture showing a number of failed GTK handshakes between xo and WBR-2310. Passphrase: secretpassphrase
7825_test.log (6.9 KB) - added by jcardona 6 years ago.
Some improvement when key installation time is reduced.
0001-libertas-Reduce-the-WPA-key-installation-time.patch (7.1 KB) - added by jcardona 6 years ago.
7825-test2.txt (4.3 KB) - added by jcardona 6 years ago.
We are finally able to consistently complete the group key handshake!
0001-libertas-Restrict-fast-key-installation-to-WPA-mode.patch (1.8 KB) - added by jcardona 6 years ago.
Incremental patch based on feedback from upstream. See commit comments.
dmesg-res-temp-unavailable (16.1 KB) - added by carrano 6 years ago.
SIOCGIWSCAN: Resourse Temporarily unavailable (4 times) and them success.
zocos.fail (66.5 KB) - added by carrano 6 years ago.
fail to associate to ssid zoco (WPA AP)
0001-libertas-Execute-WPA-key-installation-in-ioctl-cont.patch (4.2 KB) - added by jcardona 6 years ago.
Applies to head of the testing branch
dmesg.failassoc (27.2 KB) - added by carrano 6 years ago.
failing to associate to zocos ssid (WPA Buffalo AP) via GUI
H.fail (50.9 KB) - added by carrano 6 years ago.
dmesg output for failing association (lbs_debug is 0x4120)

Download all attachments as: .zip

Change History (78)

comment:1 Changed 6 years ago by dsd

  • Cc sholton added
  • Owner changed from dwmw2 to jcardona

comment:2 Changed 6 years ago by dsaxena

  • Cc dsaxena added

comment:3 Changed 6 years ago by dsd

Actually my monitor does show ACKs, I just had the filters wrong. The XO does acknowledge the GTK frame, it just gets swallowed up somewhere along the way and does not make it to wpa_supplicant.

comment:4 Changed 6 years ago by sph0lt0n

  • Cc sph0lt0n added; sholton removed

comment:5 follow-ups: Changed 6 years ago by dsd

My theory is that we are actually hitting a flaw in WPA combined with some slowness of the XO (slow processor, fullmac wifi hardware) and this AP being particularly fast.

The process is normally:

  1. AP sends EAPOL
  2. STA sends EAPOL
  3. AP sends EAPOL
  4. STA sends EAPOL
  5. STA installs PTK
  6. AP sends EAPOL encrypted by PTK

Steps 1 to 4 are for PTK negotiation. The station actually knows the PTK after step 3, but it cannot install it until after (4) because that EAPOL is required to be transmitted unencrypted. (this order of operations is imposed by wpa_supplicant)

Here we meet the flaw in WPA: the AP is due to send the the message in (6) (first part of GTK transfer) which is encrypted by the PTK, but it does not consider if the STA has actually had time to install the PTK - i.e. it does not and can not confirm that the STA has had time to complete step (5).

My theory is that the XO then receives the message in (6) but has not had time to install the PTK, so decryption fails, and it throws the message away. I do not have a way to verify this though, any ideas? Can we ask the firmware to provide undecryptable frames to the driver anyway? This would at least allow for diagnosis.

The reason that this happens on this AP and not others is that this AP is particularly fast. It takes just 1.83ms to respond to the 4th EAPOL (by sending the first GTK EAPOL), whereas other AP's I have looked at take over 3ms.

Possible workarounds:

  1. Modify the firmware to not acknowledge EAPOLs that it can't decrypt. The AP would then retransmit the GTK EAPOL, buying us more time to install the guy. This is probably not realistic, because decryption probably takes too long (you have to send acks very quickly) and assuming any notion of design in the firmware, this would be a blatent layering violation...
  2. Make the firmware defer sending the (unencrypted) EAPOL in (4) until the PTK has been installed. (messy, but probably the most reliable solution if at all possible?)
  3. Make the hardware provide undecryptable EAPOLs to the driver, and let the driver fall back on software decrypt to interpret them.
  4. Get all G1G1 users to upgrade to RSN (WPA2), which does not have this flaw.

comment:6 Changed 6 years ago by cscott

  • Keywords 8.2.0:? added

comment:7 Changed 6 years ago by ashish

  • Cc ashishs@… added

What if you redirect all wpa_supplicant logs to some file (or may be run without any log i.e.,no -d option)?

comment:8 Changed 6 years ago by dsd

I don't understand the intent of the question. That's how it is run by default. It fails to connect to the AP because the 5th EAPOL is not received by wpa_supplicant.

comment:9 Changed 6 years ago by ashish

Sometimes when debug logs are enabled (-d option) association fails. Just wanted to confirm if it fails even in the absence of this option?

comment:10 Changed 6 years ago by dsd

Yes, it does.

comment:11 Changed 6 years ago by kimquirk

  • Keywords relnote added

Not a blocker for 8.2. Release note.

comment:12 Changed 6 years ago by mbletsas

  • Cc jcardona carrano added

comment:13 Changed 6 years ago by jcardona

We were able to replicate on joyride-2346.

  1. Configure the WBR-2310 security as WPA-Personal, TKIP+PSK
  2. On the xo, associate with wpa_supplicant
  3. Observe that wpa_status never transitions into the 'COMPLETED' state:
# while true; do wpa_cli status | grep state; sleep 1; done
wpa_state=SCANNING
wpa_state=SCANNING
wpa_state=ASSOCIATING
wpa_state=ASSOCIATING
wpa_state=ASSOCIATED
wpa_state=ASSOCIATED
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=SCANNING
wpa_state=SCANNING
wpa_state=ASSOCIATING
wpa_state=ASSOCIATED
wpa_state=ASSOCIATED
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE
wpa_state=GROUP_HANDSHAKE

We were unable to capture the EAPOL-GTK with our sniffer and so cannot confirm dsd's hypothesis.
But we could pass encrypted unicast traffic to the AP which confirms that the PTK was properly installed. Unicast traffic is interrupted periodically when wpa_supplicant restarts the wpa_state machine:

64 bytes from 192.168.4.1: icmp_seq=8 ttl=127 time=2.35 ms
64 bytes from 192.168.4.1: icmp_seq=9 ttl=127 time=2.31 ms
64 bytes from 192.168.4.1: icmp_seq=10 ttl=127 time=2.50 ms
64 bytes from 192.168.4.1: icmp_seq=11 ttl=127 time=2.45 ms
64 bytes from 192.168.4.1: icmp_seq=12 ttl=127 time=2.39 ms
64 bytes from 192.168.4.1: icmp_seq=13 ttl=127 time=4.31 ms
64 bytes from 192.168.4.1: icmp_seq=14 ttl=127 time=2.88 ms
64 bytes from 192.168.4.1: icmp_seq=15 ttl=127 time=2.27 ms
64 bytes from 192.168.4.1: icmp_seq=16 ttl=127 time=2.39 ms
64 bytes from 192.168.4.1: icmp_seq=17 ttl=127 time=2.38 ms
64 bytes from 192.168.4.1: icmp_seq=23 ttl=127 time=2.29 ms
....................................^^.....................                                   
64 bytes from 192.168.4.1: icmp_seq=24 ttl=127 time=2.26 ms
64 bytes from 192.168.4.1: icmp_seq=25 ttl=127 time=2.34 ms
64 bytes from 192.168.4.1: icmp_seq=26 ttl=127 time=2.66 ms
64 bytes from 192.168.4.1: icmp_seq=27 ttl=127 time=21.8 ms
64 bytes from 192.168.4.1: icmp_seq=28 ttl=127 time=3.50 ms
64 bytes from 192.168.4.1: icmp_seq=29 ttl=127 time=2.24 ms
64 bytes from 192.168.4.1: icmp_seq=30 ttl=127 time=2.32 ms
64 bytes from 192.168.4.1: icmp_seq=31 ttl=127 time=2.35 ms
64 bytes from 192.168.4.1: icmp_seq=32 ttl=127 time=2.33 ms
64 bytes from 192.168.4.1: icmp_seq=39 ttl=127 time=3.46 ms
....................................^^.....................                                   
64 bytes from 192.168.4.1: icmp_seq=40 ttl=127 time=3.25 ms
64 bytes from 192.168.4.1: icmp_seq=41 ttl=127 time=3.03 ms
64 bytes from 192.168.4.1: icmp_seq=42 ttl=127 time=2.46 ms
64 bytes from 192.168.4.1: icmp_seq=43 ttl=127 time=3.52 ms
64 bytes from 192.168.4.1: icmp_seq=44 ttl=127 time=2.30 ms
64 bytes from 192.168.4.1: icmp_seq=45 ttl=127 time=2.87 ms
64 bytes from 192.168.4.1: icmp_seq=46 ttl=127 time=4.71 ms
64 bytes from 192.168.4.1: icmp_seq=47 ttl=127 time=2.52 ms
64 bytes from 192.168.4.1: icmp_seq=48 ttl=127 time=2.60 ms
64 bytes from 192.168.4.1: icmp_seq=55 ttl=127 time=7.75 ms
....................................^^.....................                                   

comment:14 Changed 6 years ago by gregorio

Hi Javier and team,

Is there any workaround where you can configure the Wireless AP differently and make it work (e.g. WPA v2)?

Thanks,

Greg S

comment:15 Changed 6 years ago by dsd

yes, WPA2/RSN is a suitable workaround for this bug

comment:16 Changed 6 years ago by sph0lt0n

Negative.

It is not acceptable to recommend changing the configuration of the wireless access point as a workaround to a software fault in the wireless driver.

Not everyone needing to connect to this brand of router will be in a position to administer the router or choose the AP configuration (or even understand what an AP is...)

Fix the problem, peoples.

comment:17 in reply to: ↑ 5 Changed 6 years ago by jcardona

Replying to dsd:

My theory is that we are actually hitting a flaw in WPA combined with some slowness of the XO (slow processor, fullmac wifi hardware) and this AP being particularly fast.
(...)
My theory is that the XO then receives the message in (6) but has not had time to install the PTK, so decryption fails, and it throws the message away. I do not have a way to verify this though, any ideas?

I could finally consistently capture the 5th EAPOL, the one with the group key. The EAPOL-GTK is acknowledged by the station (802.11 ack) but no EAPOL response is sent to complete the two-way GTK handshake. So it looks like your hypothesis is dead on.

The reason that this happens on this AP and not others is that this AP is particularly fast.
It takes just 1.83ms to respond to the 4th EAPOL (by sending the first GTK EAPOL), whereas
other AP's I have looked at take over 3ms.

I confirmed that as well. In my captures the time between the 4th and 5th EAPOL is between 1.8 and 2.0 ms. On a Linksys WRT310N, this time is ~13 ms, and the WPA handshake is successful.

Changed 6 years ago by jcardona

wireshark capture showing a number of failed GTK handshakes between xo and WBR-2310. Passphrase: secretpassphrase

comment:18 Changed 6 years ago by mbletsas

  • Cc mbletsas added
  • Priority changed from normal to blocker

comment:19 Changed 6 years ago by mbletsas

  • Verified set

comment:20 Changed 6 years ago by jcardona

  • Status changed from new to assigned

I agree with dsd's assessment that this a WPA flaw, which was resolved in WPA2 (in the latter, the GTK is sent during the 4-way handshake, and not immediately after). We could improve the time it takes to install the key, which is currently in the order of 4-5 ms. The current execution path for the SIOCSIWENCODEEXT ioctl() involves a delayed workqueue lbs_postpone_association_work() who then wakes up the main driver thread lbs_thread().

If we make this a direct command (following the work initiated by dcbw for other commands) we could probably get down to ~1 ms, which might be sufficient for *this AP*.
In the absence of a better alternative we've started working on trimming down the key installation time.

comment:21 Changed 6 years ago by jcardona

Just for the record, other ideas that were considered...

dsd:1 Don't ack undecipherable frames: not feasible, we only know it's an undecipherable frame after it's been acknowledged.

dsd:2 Defer transmission of (unencrypted) 4th EAPOL until key is installed: not feasible, can't install key on rx path only.

dsd:3 Pass undecipherable EAPOLs to driver: Possible, but really hard. (Need an alternative way way to mark encrypted frames as 802.11 header is gone, add decryption capability to the driver...)

dsd4: Use WPA2: I recommend this, but it's probably not acceptable...

jc5: Buffer undecipherable frames in firmware and pass to host once PSK is installed: would work, but requires use of precious buffers in firmware.

jc6: Turn off RX path during the race window (i.e. immediately before sending the 4th EAPOL, and back on once the key has been installed). Not sure how the AP would behave if it runs out of retries. Ugly.

comment:22 follow-up: Changed 6 years ago by dsd

I made the same observation with the workqueue thing and made a simple modification to make the command be dispatched directly from the ioctl handler. Although the key is then programmed quicker, it was still not quick enough for this particular AP...

comment:23 in reply to: ↑ 22 Changed 6 years ago by jcardona

Replying to dsd:

I made the same observation with the workqueue thing and made a simple modification to make the command be dispatched directly from the ioctl handler. Although the key is then programmed quicker, it was still not quick enough for this particular AP...

Thanks for the info. Did you also modify the code to send the command directly to usb instead of queuing it and kicking lbs_thread()?

comment:24 Changed 6 years ago by jcardona

The attached patch reduces the key installation time by at least a factor of 4.
This seems to be sufficient to succeed in the GTK handshake on my setup, but the installation time is still on the margin and usually the supplicant will have to retry a few times.

I was able to complete the GTK handshake in 20 consecutive test iterations (test results attached), but sometimes it takes almost a minute to successfully complete WPA key exchange.

Changed 6 years ago by jcardona

Some improvement when key installation time is reduced.

Changed 6 years ago by jcardona

We are finally able to consistently complete the group key handshake!

comment:25 Changed 6 years ago by jcardona

  • Action Needed changed from never set to review

The new version of the patch further reduces the key installation time by eliminating redundant commands sent to firmware prior to PTK installation. With it we can successfully complete the WPA handshake.

Please retest with the attached patch. I will prepare another version of the patch for upstream submission.

comment:26 Changed 6 years ago by jcardona

Patch submitted upstream. Turns out that some of the simplifications I introduced were already there, resulting in a much smaller patch http://thread.gmane.org/gmane.linux.kernel.wireless.general/20547

comment:27 Changed 6 years ago by gregorio

  • Blocking 6191 added

comment:28 Changed 6 years ago by mbletsas

  • Blocking 6191 removed

comment:29 Changed 6 years ago by carrano

Netgear WNR834M with WPA: XO failed 100% of the times. After the patch, it connects 100% of the times.

A test kernel can be found at: http://www.laptop.org/teamwiki/index.php/Tech:Wireless_Test_Kernels#Test_kernel_C7
(I am sure Deepak is a better source of kernels)

comment:30 in reply to: ↑ 5 Changed 6 years ago by carrano

Replying to dsd:

My theory is that we are actually hitting a flaw in WPA combined with some slowness of the XO (slow processor, fullmac wifi hardware) and this AP being particularly fast.

The process is normally:

  1. AP sends EAPOL
  2. STA sends EAPOL
  3. AP sends EAPOL
  4. STA sends EAPOL
  5. STA installs PTK
  6. AP sends EAPOL encrypted by PTK

Steps 1 to 4 are for PTK negotiation. The station actually knows the PTK after step 3, but it cannot install it until after (4) because that EAPOL is required to be transmitted unencrypted. (this order of operations is imposed by wpa_supplicant)

Here we meet the flaw in WPA: the AP is due to send the the message in (6) (first part of GTK transfer) which is encrypted by the PTK, but it does not consider if the STA has actually had time to install the PTK - i.e. it does not and can not confirm that the STA has had time to complete step (5).

My theory is that the XO then receives the message in (6) but has not had time to install the PTK, so decryption fails, and it throws the message away. I do not have a way to verify this though, any ideas? Can we ask the firmware to provide undecryptable frames to the driver anyway? This would at least allow for diagnosis.

It all makes perfect sense and seems to explain what is happening. What I don't get is why 4 and 5 are inverted. Doesn't the standard (802.11i) states that the fourth element of the PTK handshake should be sent *after* the key is installed? (item "f" bellow)?

The 4-Way Handshake consists of the following steps:
a) The Authenticator sends an EAPOL-Key frame containing an ANonce.
b) The Supplicant derives a PTK from ANonce and SNonce.
c) The Supplicant sends an EAPOL-Key frame containing SNonce, the RSN information element from
the (Re)Association Request frame, and a MIC.
d) The Authenticator derives PTK from ANonce and SNonce and validates the MIC in the EAPOLKey
frame.
e) The Authenticator sends an EAPOL-Key frame containing ANonce, the RSN information element
from its Beacon or Probe Response messages, MIC, whether to install the temporal keys, and the
encapsulated GTK.
f) The Supplicant sends an EAPOL-Key frame to confirm that the temporal keys are installed.

Unless...

I couldn't find the draft 3 of 802.11i (on which TKIP is based) but it may be the case that for TKIP, the fourth element is just a regular 802.11 ack (type:00/subtype:1101) as some random pages around (including wikipedia) seem to imply. If this is the case, this is really an early flaw of that draft and will cause this kind of timing issue.

But...

If this is not the case, it seems that the supplicant should send the fourth frame only after the installation of the key.

Or...

Is it the case the once the key is installed the supplicant cannot send a non encrypted frame? It doesn't make sense to me.

comment:31 Changed 6 years ago by dsd

I think you hit it at the end. (f) is sent unencrypted, and we do not have a Linux interface to send unencrypted frames once keys have been installed. And in some cases where crypto is done in hardware, I would expect that the same problem exists on the hardware level too; cannot send unencrypted frames when keys are installed.

comment:32 follow-up: Changed 6 years ago by carrano

I compared results with Deepak's patched kernel (based on 2.6.25) and stock 8.2-760.

Though there is a general improvement, there is also an unexpected correlation with restarting sugar and the success rate which I can access my Netgear WNR834M.

With the patched kernel, if I cannot associate the XO to my Netgear access points more than 3 consecutive times, I restart the X (ctrl-alt-backspace) and this will fix the problem more than 90% of the times.

As I don't have other APs that are so difficult to connect in WPA as this Netgear (note that regular notebooks also have trouble with this AP model in WPA), I am not sure if this is a very specific issue with this AP. If it is not very specific there is more to the WPA issue than the key installation time.

As noted above, even with 2.6.22 (not patched) I could not connect to this AP in WPA (but I could if I patched 2.6.22 with Javier's patch).

Does anyone has other test results with other APs?

With a Buffalo WHR-G54S AP the improvement of the success rate was from 50% to about 80%. With a Linksys WRT54G AP, there were no problems in connecting at all.

In short, no doubt that reducing key installation time improved matters, the question is what other issues may prevent us to associate to WPA APs.

comment:33 in reply to: ↑ 32 ; follow-ups: Changed 6 years ago by carrano

Replying to carrano:

In short, no doubt that reducing key installation time improved matters, the question is what other issues may prevent us to associate to WPA APs.

Actually, I'll step back a little and rephrase that: "What is preventing us to associate", since this issue I am facing may not be related to WPA. Still, I'll post this here for now, and ask for your opinions whether this is material for another ticket.

What I notice is that during repetitions of the associate to WPA test there are some failures that do not seem related to WPA. I've been trying to capture logs. For now, what I have is a lot of nonsensical scanning information in dmesg output.

  • APs listed with wrong SSIDs
  • APs listed with wrong mac addresses
  • APs with 00:00:00:00:00:00 mac address
  • bogus hidden SSIDs
  • Deauthentications due to "New BSSID"

One interesting example:

[ 1514.829213] libertas scan: 00: BSSID 02:2d:3d:c5:45:07, RSSI 32, SSID 'olpc-mesh'
[ 1514.829213] libertas scan: 01: BSSID 00:00:00:00:00:00, RSSI 0, SSID '<hidden>'
[ 1514.829213] libertas scan: 02: BSSID 00:14:6c:df:aa:00, RSSI 60, SSID 'Harumaki'
[ 1514.829213] libertas scan: 03: BSSID 00:16:01:84:2b:0f, RSSI 64, SSID 'zocos'
[ 1514.829213] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
[ 1514.831117] libertas scan: is_network_compatible() FAILED: wpa_ie 0xdd wpa2_ie 0x0 WEP d WPA d WPA2 d privacy 0x10
[ 1514.831153] libertas scan: is_network_compatible() FAILED: wpa_ie 0xdd wpa2_ie 0x0 WEP d WPA d WPA2 d privacy 0x10
[ 1514.831181] libertas assoc: Could not find best network
[ 1514.831201] libertas assoc: ASSOC: reconfiguration attempt unsuccessful: -101
[ 1514.831250] libertas assoc: Association Request:
[ 1514.831261]     flags:     0x00000770
[ 1514.831271]     SSID:      'zocos'
[ 1514.831279]     chann:     11
[ 1514.831288]     band:      0
[ 1514.831296]     mode:      2
[ 1514.831304]     BSSID:     00:00:00:00:00:00
[ 1514.831314]     secinfo:  
[ 1514.831322]     auth_mode: 1
[ 1514.831348] libertas assoc: Deauthenticating due to new BSSID

It seems that scanning is generating garbage and, as this is the initial phase of an association it will have a domino effect over the process.

Note: this tests are using the kernel Deepak patched for us and 8.2-760.

What do you think? Should we spin off a scanning issue ticket or am I chasing ghosts?

comment:34 Changed 6 years ago by mbletsas

Ricardo,

The obvious thing missing from your report is whether this behavior changes with Javier's patch.

What do you see with an unpatched laptop?

M.

comment:35 Changed 6 years ago by mbletsas

I haven't been able to replicate the behavior Ricardo observed at 1CC.
We do know that the scan handling code has been problematic in the past, so we should probably track this on a different ticket.

M

Changed 6 years ago by jcardona

Incremental patch based on feedback from upstream. See commit comments.

comment:36 in reply to: ↑ 33 Changed 6 years ago by jcardona

Replying to carrano:

Replying to carrano:

In short, no doubt that reducing key installation time improved matters, the question is what other issues may prevent us to associate to WPA APs.

Actually, I'll step back a little and rephrase that: "What is preventing us to associate", since this issue I am facing may not be related to WPA. Still, I'll post this here for now, and ask for your opinions whether this is material for another ticket.

What I notice is that during repetitions of the associate to WPA test there are some failures that do not seem related to WPA. I've been trying to capture logs. For now, what I have is a lot of nonsensical scanning information in dmesg output.

  • APs listed with wrong SSIDs
  • APs listed with wrong mac addresses
  • APs with 00:00:00:00:00:00 mac address
  • bogus hidden SSIDs
  • Deauthentications due to "New BSSID"

One interesting example:

[ 1514.829213] libertas scan: 00: BSSID 02:2d:3d:c5:45:07, RSSI 32, SSID 'olpc-mesh'
[ 1514.829213] libertas scan: 01: BSSID 00:00:00:00:00:00, RSSI 0, SSID '<hidden>'
[ 1514.829213] libertas scan: 02: BSSID 00:14:6c:df:aa:00, RSSI 60, SSID 'Harumaki'
[ 1514.829213] libertas scan: 03: BSSID 00:16:01:84:2b:0f, RSSI 64, SSID 'zocos'
[ 1514.829213] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
[ 1514.831117] libertas scan: is_network_compatible() FAILED: wpa_ie 0xdd wpa2_ie 0x0 WEP d WPA d WPA2 d privacy 0x10
[ 1514.831153] libertas scan: is_network_compatible() FAILED: wpa_ie 0xdd wpa2_ie 0x0 WEP d WPA d WPA2 d privacy 0x10
[ 1514.831181] libertas assoc: Could not find best network
[ 1514.831201] libertas assoc: ASSOC: reconfiguration attempt unsuccessful: -101
[ 1514.831250] libertas assoc: Association Request:
[ 1514.831261]     flags:     0x00000770
[ 1514.831271]     SSID:      'zocos'
[ 1514.831279]     chann:     11
[ 1514.831288]     band:      0
[ 1514.831296]     mode:      2
[ 1514.831304]     BSSID:     00:00:00:00:00:00
[ 1514.831314]     secinfo:  
[ 1514.831322]     auth_mode: 1
[ 1514.831348] libertas assoc: Deauthenticating due to new BSSID

It seems that scanning is generating garbage and, as this is the initial phase of an association it will have a domino effect over the process.

Note: this tests are using the kernel Deepak patched for us and 8.2-760.

What do you think? Should we spin off a scanning issue ticket or am I chasing ghosts?

A while ago the association related tasks were moved to a deferred work queue. One of the delayed tasks was the installation of keys. My patch removes the artificial delays in key installation. It may have affected how association requests are constructed and the command sequence sent to the firmware.

So cannot have affected scanning, but it may have affected association.

  1. Could you run your test with 0x4121 > /sys/modules/libertas/parameters/libertas_debug? That would show us the command sequence sent to firmware.
  1. Can you post your testing procedure to the ticket?

Thanks!

comment:37 follow-up: Changed 6 years ago by jcardona

This debug messasage:

[ 1514.831250] libertas assoc: Association Request:
[ 1514.831261]     flags:     0x00000770
[ 1514.831271]     SSID:      'zocos'
[ 1514.831279]     chann:     11
[ 1514.831288]     band:      0
[ 1514.831296]     mode:      2
[ 1514.831304]     BSSID:     00:00:00:00:00:00
[ 1514.831314]     secinfo:  
[ 1514.831322]     auth_mode: 1

is generated when the WPA key is installed. No association request is transmitted when that happens. (I'll prepare a patch to omit this debug message in this situation)

comment:38 in reply to: ↑ 33 Changed 6 years ago by jcardona

Replying to carrano:

Replying to carrano:

  • Deauthentications due to "New BSSID"
[ 1514.831348] libertas assoc: Deauthenticating due to new BSSID

Aha! When we install the WPA key we disassociate from the current BSS, I see.
Let me ruminate what would be the best way to address this...

comment:39 follow-ups: Changed 6 years ago by cscott

Is this in the testing kernel? If so, what was the git commit? Thanks.

comment:40 in reply to: ↑ 39 Changed 6 years ago by dsaxena

Replying to cscott:

Is this in the testing kernel? If so, what was the git commit? Thanks.

abe7a66b36f344a822a5fa17f87c49a25aa05790 is the first version of this patch, before feedback from community. It is what Ricardo has been testing.

comment:41 in reply to: ↑ 39 Changed 6 years ago by carrano

Replying to cscott:

Is this in the testing kernel? If so, what was the git commit? Thanks.

Scott,
All the tests I performed since comment:32 used the kernel provided by Deepak (http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20080915.5.olpc.abe7a66b36f344a.i586.rpm)

comment:42 in reply to: ↑ 37 ; follow-up: Changed 6 years ago by carrano

Replying to jcardona:

This debug messasage:

[ 1514.831250] libertas assoc: Association Request:
[ 1514.831261]     flags:     0x00000770
[ 1514.831271]     SSID:      'zocos'
[ 1514.831279]     chann:     11
[ 1514.831288]     band:      0
[ 1514.831296]     mode:      2
[ 1514.831304]     BSSID:     00:00:00:00:00:00
[ 1514.831314]     secinfo:  
[ 1514.831322]     auth_mode: 1

is generated when the WPA key is installed. No association request is transmitted when that happens. (I'll prepare a patch to omit this debug message in this situation)

Yes, I confirm the following:

What I was investigating were these void attempts to associate as
registered in:

[ 4625.171741] libertas assoc: Association Request:
[ 4625.171754]     flags:     0x00000752
[ 4625.171763]     SSID:      '<hidden>'
[ 4625.171773]     chann:     6
[ 4625.171781]     band:      0
[ 4625.171789]     mode:      2
[ 4625.171798]     BSSID:     00:00:00:00:00:00
[ 4625.171808]     secinfo:
[ 4625.171816]     auth_mode: 1

These attemps to not generate equivalent assoc request frames.

For the record, a test was performed with no NM or GUI (using wpa_supplicant on the
command line):

While observing the sniffer and the XO, I noticed the following timeline:

  1. I start the wpa_supplicant,
  1. the sniffer shows nothing while XO reports

ioctl[SIOCGIWSCAN]:Resource temporary unavailable" 6 times

  1. dmesg registers:
[ 8062.259465] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 8063.289970] libertas assoc: Association Request:
[ 8063.289983]     flags:     0x00000750
[ 8063.289992]     SSID:      '<hidden>'
[ 8063.290001]     chann:     6
[ 8063.290010]     band:      0
[ 8063.290018]     mode:      2
[ 8063.290026]     BSSID:     00:00:00:00:00:00
[ 8063.290036]     secinfo:
[ 8063.290044]     auth_mode: 1
  1. a assoc req frame is finally detected (and it is correctly formed).
  1. assoc completes on the XO ("CTRL-EVENT-CONNECTED").
  1. dmesg registers:
    [ 8066.351088] libertas assoc: Association Request:
    [ 8066.351088]     flags:     0x00000c36
    [ 8066.351088]     SSID:      'Harumaki'
    [ 8066.351088]     chann:     6
    [ 8066.351088]     band:      0
    [ 8066.351088]     mode:      2
    [ 8066.351088]     BSSID:     00:14:6c:df:aa:00
    [ 8066.351088]     secinfo:   WPA
    [ 8066.351088]     auth_mode: 1
    [ 8066.351089] libertas scan: is_network_compatible() WPA: wpa_ie 0xdd
    wpa2_ie 0x0 WEP d WPA e WPA2 d privacy 0x10
    [ 8066.351089] libertas assoc: ASSOC_CMD: num rates 12
    [ 8066.351089] libertas assoc: ASSOC_CMD: capability 0x0431
    [ 8066.396652] libertas assoc: ASSOC_RESP: mac_control is 0x23
    [ 8066.396747] libertas assoc: ASSOC: lbs_associate(bssid) returned 0
    [ 8066.396781] libertas assoc: ASSOC: associated to 'Harumaki',
    00:14:6c:df:aa:00
    

comment:43 Changed 6 years ago by jcardona

Replying to jcardona:

Aha! When we install the WPA key we disassociate from the current BSS, I see.
Let me ruminate what would be the best way to address this...

Nonsense, javier. Installing the WPA key cannot trigger that disassociation.

Just dumping my notes on this problem here...

The association thread is passed an association request structure (assoc_req). The assoc_req contains information about what tasks should be executed when the thread is executed.

Up until the "reduce-WPA-installation" patch
(abe7a66b36f344a822a5fa17f87c49a25aa05790) a burst of ioctls would be
collected into a single assoc_req and passed to the association thread for
later execution. One of those ioctls (SIOCSIWENCODEEXT) is the one that
installs the WPA keys. This deferred execution will cause the WPA problem
that's reported in this ticket.

The patch forces the association thread to be scheduled without delay if an
SIOCSIWENCODEEXT ioctl is received. That could result in the execution of an
incomplete assoc_req if the SIOCSIWENCODEEXT ioctl is received in the middle
of ioctl burst. Ricardo debug logs will reveal if this is what he is seeing.

comment:44 in reply to: ↑ 42 Changed 6 years ago by jcardona

Replying to carrano:

So, other than the misleading debug message, do you have any other problems associating to WPA APs?

Changed 6 years ago by carrano

SIOCGIWSCAN: Resourse Temporarily unavailable (4 times) and them success.

comment:45 Changed 6 years ago by carrano

For this dmesg output, the test performed was:

  • Kernel commit is abe7a66b36f344a822a5fa17f87c49a25aa05790
  • Command line wpa_supplicant. NM not running.
  • Association was eventually successful after SIOCGIWSCAN: Recourse Temporarily unavailable (4 times)

comment:46 Changed 6 years ago by jcardona

BTW, apparently the WPA specification (version 3.1, 2004) states that the AP should reply "if the EAPOL_Key packet sent out fails to obtain a response from the Supplicant". Not that we can argue with widely deployed Access Points, though...

(Thanks to Rajesh for that info)

comment:47 follow-up: Changed 6 years ago by cscott

  • Action Needed changed from review to approve for release

Requesting kernel-2.6.25-20080922.1.olpc.f2c3fe6773a1021.src.rpm for inclusion in stable. This commit has both the original abe7a6 patch, and the more recent one by Javier which matches the patch submitted upstream.

comment:48 in reply to: ↑ 47 Changed 6 years ago by dsaxena

Replying to cscott:

Requesting kernel-2.6.25-20080922.1.olpc.f2c3fe6773a1021.src.rpm for inclusion in stable. This commit has both the original abe7a6 patch, and the more recent one by Javier which matches the patch submitted upstream.

NACK.

As mentioned in my response to the email thread on this topic, the testing
kernel branch includes a change to the mouse driver 93970329291a9ff73301a96a27c4af274f096835) and having now looked at this, I don't really feel comfortable pulling it into a release candidate as the comments in the commit do not explain why this commit was done and where the bits come from. I will revert, push, and we can then pull the new RPM.

comment:49 Changed 6 years ago by dsaxena

I have pushed the following commit to testing. Please use the resulting RPM for the 8.2 build.

commit 38b5fedf917fc36d68f9712541d46dbccb4d7cb1
Author: Deepak Saxena <dsaxena@laptop.org>
Date:   Mon Sep 22 15:19:35 2008 -0700

    Revert "psmouse: upstream sync, fix low power toggle bugs"

    This reverts commit 93970329291a9ff73301a96a27c4af274f096835.

    The original commit contains changes to psmouse-base.c and
    olpc.c that mention an upstream fix and fixing a low power
    toggle bug but the commit does not point to an upstream source
    or describe what the bug is. Upon searching lkml, it looks
    like this commit is a merge of several patches sent upstream.

    As this has not received wide-spread testing yet, I am
    reverting for now and we can add the changes back in
    properly as separate commits after the 8.2 release.

    Signed-off-by: Deepak Saxena <dsaxena@laptop.org>

comment:50 Changed 6 years ago by cscott

kernel-2.6.25-20080922.2.olpc.38b5fedf917fc36.src.rpm is the package in question.

comment:51 Changed 6 years ago by mstone

  • Action Needed changed from approve for release to add to release

Approved.

comment:52 Changed 6 years ago by cscott

  • Action Needed changed from add to release to test in release

Please test build 761 resulting from build-repo commit d3a00642.

comment:53 Changed 6 years ago by carrano

With the new kernel (38b5fedf917fc36), two batteries of tests (separated by reboot) showed mixed results:

1 - tests alternating association with two WPA APs succeeded in more than 30 interactions (15 each AP).

2 - tests alternating association with two WPA APs (same APs as in "1") succeeded 100% for AP-1 and 0% for AP-2.

Note: association was commanded in the GUI

dmesg output for failing to associate to AP-2 (ssid zocos) is attached.

Changed 6 years ago by carrano

fail to associate to ssid zoco (WPA AP)

Changed 6 years ago by jcardona

Applies to head of the testing branch

comment:54 follow-up: Changed 6 years ago by jcardona

Ricardo,

I pulled out the key installation from the association task. Could you see if that resolves the problems that you currently see with association?

Thanks,

comment:55 in reply to: ↑ 54 Changed 6 years ago by carrano

Replying to jcardona:

Ricardo,

I pulled out the key installation from the association task. Could you see if that resolves the problems that you currently see with association?

Thanks,

Deepak,
Could you please provide a new kernel with that patch?
Thank you!

comment:56 Changed 6 years ago by dsaxena

Pushed as f10b654367d7065d50e7d5e3649933453cea3102 to testing repo, RPM soon.

comment:57 follow-up: Changed 6 years ago by jcardona

I need a good way to replicate the new problems reported in this ticket. My original unit test against the WBR-2310 was always failing, and now it always passes, so is not useful anymore.

I've now set up 3 APs, one with WPA, one with WPA2 and one with WEP40 security protocols. And I wrote this test script:

#! /bin/bash

function wait_for_assoc()
{
        WPA_STATUS=`wpa_cli status | grep 'wpa_state' | cut -d= -f2`
        retries=0
        while [ "$WPA_STATUS" != COMPLETED ]
        do
                WPA_STATUS=`wpa_cli status | grep 'wpa_state' | cut -d= -f2`
                retries=$((retries+1))
                [ $retries -lt 10000 ] || exit 1
        done
        echo $WPA_STATUS
}

for i in `seq 0 20`
do
        killall wpa_supplicant
        sleep 1
        n=$((i%3))
        SEC_TYPE=`cat conf/$n.conf  | grep info | cut -d: -f2`
        echo -n "Trying AP$n ($SEC_TYPE)..."
        wpa_supplicant -B -Dwext -ieth1 -cconf/$n.conf 2> /dev/null 
        wait_for_assoc 
done

echo PASS
exit 0

(note that the test requires wpa_supplicant configuration files named conf/0.conf, conf/1.conf and conf/2.conf)

I get perfect results:

Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED     # AP1 is the WBR-2310
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
Trying AP0 (WPA2)...COMPLETED
Trying AP1 (WPA)...COMPLETED
Trying AP2 (WEP40)...COMPLETED
PASS

If anyone out there wants to run the provided test script I would be very grateful.

comment:59 in reply to: ↑ 57 Changed 6 years ago by carrano

Replying to jcardona:

I've now set up 3 APs, one with WPA, one with WPA2 and one with WEP40 security protocols. And I wrote this test script:

Javier,

Not sure why the script is not working for me:

  • Switched to eth0 (you use eth1)
  • Switched to modulo2 (since I want to test with 2 WPA APs)
  • Removed white space after #!

when wpa_cli is evoked it does not find the wpa_supplicant running and the script returns:

Failed to connect to wpa_supplicant - wpa_ctrl_open: No such file or directory

My 0.conf supplicant config file:

network={
	ssid="zocos"
	#psk="xxxxxxxx"
	psk=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
}

It seems that your conf file has something like a "info" line, though this is obviously not the problem. Can you provide a sample?

Anyway, I am not associating with the latest kernel, please see next comment.

comment:60 Changed 6 years ago by carrano

Brief testing with http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20080923.3.olpc.f10b654367d7065.i586.rpm

  • Association with wpa_supplicant (CLI), with no NetworkManger running succeeds for both wpa test aps. (3 tests with each, all successful)
  • Association via GUI fails most of the times. I'll attach a dmesg output.

Changed 6 years ago by carrano

failing to associate to zocos ssid (WPA Buffalo AP) via GUI

comment:61 Changed 6 years ago by carrano

for this dmesg output an attempt was made via GUI. After some time the association was not completed and another AP started blinking. Curiously:

  • the dmesg do not register the command 0x0050.
  • the three "libertas assoc: Association Request:" blocks bring bogus info.
  • no association frame was captured by a monitoring station.

It seems that the clicking didn't result in an actual (command 0x50 and association request frame) attempt to connect.

comment:62 Changed 6 years ago by carrano

Testing the possible interference of setting /sys/modules/libertas/parameters/libertas_debug flags on the testing results.

(please read through since there are some twists in the way)

Z and H are two WPA access points. First the tests seemed to confirm a strong correlation, but later they seem to not change anything.

These are GUI-only tests

libertas_debug: 0x4120
H: 5/5 (five hits in five runs)
Z: 1/5 (only one success in five runs)

libertas_debug: 0x0100
H: 5/5
Z: 5/5

libertas_debug: 0x4120
H: 5/5
Z: 5/5

libertas_debug: 0x0100
H: 5/5
Z: 5/5

libertas_debug: 0x4301
H: 5/5
Z: 5/5

libertas_debug: 0xfffffff
H: 5/5
Z: 5/5
Note: this will make the XO very slow:
load: 1.80 1.00 0.95

libertas_debug: 0x0
H: 5/5
Z: 5/5

Everything seemed fine and in contrast with my yesterday results (when fails prevailed) with the same controlled test (same APs, same config, same XO running same software (build: 760+ kernel: f10b654367d7065)). Until I restarted the XO and association started failing again. Please see H,fail attached file.

It is probably too early to correlate the problem to an initializing condition, but I'll keep investigating what steps in the test setup might interfere.

I suggest that if facing total success or failure in these tests, another test battery be started after reboot, This may change the results dramatically.

At first I am ruling out spectrum conditions, since I am monitoring them to be very similar during the test batteries.

Changed 6 years ago by carrano

dmesg output for failing association (lbs_debug is 0x4120)

comment:63 follow-up: Changed 6 years ago by mbletsas

Ricardo,

I have seen a strong correlation between failure and attemts to associate immediately after the laptop has booted (as opposed to few minutes after that when NM has finished going through its connection state machine)

M.

comment:64 follow-up: Changed 6 years ago by jcardona

I believe that the patches attached to this ticket resolve the original problem described in the summary line. We still have to combine the patches and submit upstream.

I have created ticket #8667 to track the scanning problems that Ricardo reports in the ticket.

comment:65 in reply to: ↑ 63 Changed 6 years ago by carrano

Replying to mbletsas:

Ricardo,

I have seen a strong correlation between failure and attemts to associate immediately after the laptop has booted (as opposed to few minutes after that when NM has finished going through its connection state machine)

M.

M,

In the tests I reported, I wait about five minutes before starting. It used to take a little less than 4 minutes for NM to end his cycle. Any way I'll recheck this time to see if it increased in recent builds.

comment:66 in reply to: ↑ 64 Changed 6 years ago by carrano

Replying to jcardona:

I believe that the patches attached to this ticket resolve the original problem described in the summary line. We still have to combine the patches and submit upstream.

I have created ticket #8667 to track the scanning problems that Ricardo reports in the ticket.

I agree that the original issue reported in this ticket is fixed and also that we have other association issues. So, maybe not close already, but at least switch to other tickets is a good idea. I'll start another one for bogus entries on the dmesg output (though lesser priority) and maybe other one after I review the whole thread.

comment:67 Changed 4 years ago by cjl

  • Cc cjl added

dsd, while this is a long and hairy ticket and it has spun off children, do you agree with comment 65 above that the headlined issue "can't complete WPA handshake with D-Link WBR-2310" has been fixed?

comment:68 Changed 4 years ago by dsd

no idea

it was fixed for a time, then we regressed, and i suspect its broken again now

Note: See TracTickets for help on using tickets.