Ticket #8799 (closed defect: fixed)

Opened 6 years ago

Last modified 2 years ago

WPA association issue when attempted from GUI: XO do not respond to first EAPOL frame

Reported by: carrano Owned by: jcardona
Priority: high Milestone: 8.2.1
Component: wireless Version: not specified
Keywords: cjbfor9.1.0 8.2.1:+ Cc: mbletsas, dsaxena, mstone, joe
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

When analysing traffic captures to investigate #8667, Javier Cardona (Cobybit) noted that the XO is not responding to the first EAPOL frame from the Access Point.

The above was noted in fails to associate to two different models of Access Points in WPA mode: Buffalo WHR-G54S and Netgear WNR834M.

The traffic captures are attached.

Attachments

netgear-fail.pcap (227.7 kB) - added by carrano 6 years ago.
failing to associate to Harumaki ssid (WPA Netgear AP) via GUI
buffalo-wpa-fail.pcap (189.9 kB) - added by carrano 6 years ago.
failing to associate to zocos ssid (WPA Buffalo AP) via GUI
wpa_supplicant.failed.log (16.6 kB) - added by colin 5 years ago.
wpa_supplicant.log from an unsuccessful attempt to connect to aloha-test with the GUI
wpa_supplicant.success.log (12.8 kB) - added by colin 5 years ago.
wpa_supplicant.log from a successful attempt to connect to aloha-test from the command line
nm-patch-1.patch (325 bytes) - added by colin 5 years ago.
workaround patch which improves success rate

Change History

Changed 6 years ago by carrano

failing to associate to Harumaki ssid (WPA Netgear AP) via GUI

Changed 6 years ago by carrano

failing to associate to zocos ssid (WPA Buffalo AP) via GUI

follow-up: ↓ 2   Changed 6 years ago by colin

I have been trying to replace this problem. I have a bash script which switches between two access points repeatedly, waiting until the OLPC has associated to one to switch to the other.

I thought that maybe high system load could be a factor in causing this condition. So I have been using a program called "stress," available at http://weather.ou.edu/~apw/projects/stress/, to generate various different kinds of load.

However, so far, I have not seen any failures.

The main difference between my test setup and the one described in this bug is probably that I am not running the GUI. Also, the brands of AP that I am using are different; I do not have a Buffalo or Netgear AP to use.

I'm going to try associating through the GUI to see if that works as well.

in reply to: ↑ 1   Changed 5 years ago by carrano

The main difference between my test setup and the one described in this bug is probably that I am not running the GUI. Also, the brands of AP that I am using are different; I do not have a Buffalo or Netgear AP to use.

I'm going to try associating through the GUI to see if that works as well.

In CLI mode I don't get this bug either. This seems normal in wpa timing related issues. So, if you can repeat this in GUI and you can connect normally this would be also a useful test. Btw, what is the AP model you are testing with?

  Changed 5 years ago by mbletsas

The reports are consistent with what I observe. My theory at this point is that when NM is running it issues commands to the radio while the association is still in progress, making the association fail.

Empirically, I have had 100% association success lately via the GUI by clicking on the AP when the radio is idle (judged by observing the neighborhood view and the activity lights)

M.

  Changed 5 years ago by colin

I have reproduced the bug.

I did some manual testing with the GUI, and the bug appeared clearly. I can always associate with open APs, but seldom with WPA ones.

Unfortunately, we are not able to see the wpa_supplicant logs to see what is going wrong, because that program's standard out and standard error are being redirected to /dev/null.

NetworkManager spawns wpa_supplicant. I examined the source for nm-olpc. It seems like some attempt was made to provide access to the wpa_supplicant logs. If the file "/etc/NetworkManager/log-supplicant" exists, NetworkManager will supply the -f option to wpa_supplicant. However, the version of wpa_supplicant on the OLPC can't accept just an -f; it needs a logfile parameter for -f. The end result is that after you do "touch /etc/NetworkManager/log-supplicant", your wifi won't work at all, until you delete that file :[

I am working on getting the wpa_supplicant logs out of there... hopefully they will be helpful.

Changed 5 years ago by colin

wpa_supplicant.log from an unsuccessful attempt to connect to aloha-test with the GUI

Changed 5 years ago by colin

wpa_supplicant.log from a successful attempt to connect to aloha-test from the command line

  Changed 5 years ago by colin

I have posted the two wpa_supplicant.log files (failure and success.)

It seems to be a libertas issue. I'll check it out.

  Changed 5 years ago by jcardona

  • spec_stage set to unknown
  • spec_reviewed set to 0

We currently suspect that the security info configuration is overwritten/trashed between the point it is set via ioctls and the moment the association worker picks it up.

The following log shows a failure with an WPA-enabled AP:

# Assoc Request is asking to update the security info field, but WPAenabled is unset
# This will turn off WPA in this association request 
[  520.445124] libertas assoc: Association Request:
[  520.445135]     flags:     0x00000f32
[  520.445144]     flags text:[SSID][MODE][BSSID][WPA_MCAST_KEY][WPA_UCAST_KEY][SECINFO][WPA_IE]
[  520.445159]     SSID:      'aloha-test'
[  520.445168]     chann:     6
[  520.445176]     band:      0
[  520.445184]     mode:      2
[  520.445192]     BSSID:     00:00:00:00:00:00
[  520.445201]     secinfo:
[  520.445209]     auth_mode: 1
...
# indeed, association fails because of security configuration mismatch
[  521.853208] libertas enter: is_network_compatible():204
[  521.853233] libertas scan: is_network_compatible() FAILED: wpa_ie 0xdd wpa2_ie 0x0 WEP d WPA d WPA2 d privacy 0x10
[  521.853260] libertas leave: is_network_compatible():253, matched: 0
[  521.853282] libertas leave: lbs_find_ssid_in_list():1048, found_bss 00000000
[  521.853303] libertas assoc: SSID not found; cannot associate
[  521.853323] libertas leave: assoc_helper_essid():75, ret 0
[  521.853343] libertas leave: assoc_helper_associate():134, ret 0

No solution yet, but getting closer.

  Changed 5 years ago by jcardona

One of the failure modes that we see ("bad-secinfo") is characterized by using an invalid security configuration to associate with a WPA AP. This only happens when the association is requested from the GUI. A log showing this failure is included in my previous comment:

[  520.445201]     secinfo:             #  <---- should be WPA

In my last comment I suspected the driver clearing secinfo. Turns out that the security info is cleared by wpa_supplicant via a SIOCSIWAUTH ioctl prior to associating:

[70110.612569] libertas wext: IW_AUTH_WPA_VERSION_DISABLED

In addition to this, we've observed that this failure always happens when NetworkManager invokes wpa_supplicant with a ap_scan mode set to 2 (see /usr/share/doc/wpasupplicant/examples/README.wpa_supplicant.conf for details about scan modes). In those instances when NM invokes wpa_supplicant with ap_scan mode set to 1 we do not observe this failure.

We can successfully associate with our AP by invoking wpa_supplicant in both modes, 1 and 2. But looking at NM source code we see that AP_SCAN 2 is only used if:

if (!nm_ap_get_broadcast (ap) || is_adhoc || !supports_wpa)
                ap_scan = "AP_SCAN 2";

And none of these conditions should be true:

  • !nm_ap_get_broadcast (ap): true for hidden APs that do not broadcast their SSID (not our case)
  • is_adhoc: if the network is ad_hoc
  • !supports_wpa: if the driver reports that the hardware can't do WPA

Our conclusion at this time is that this failure occurs because NM gets invalid information about the AP and invokes wpa_supplicant with invalid security parameters.

  Changed 5 years ago by jcardona

  • next_action changed from never set to communicate
  • summary changed from WPA association issue: XO do not respond to first EAPOL frame to WPA association issue when attempted from GUI: XO do not respond to first EAPOL frame

We can work around this problem by deleting nminfo.py persistent list of preferred networks.

If we issue the following command on the shell before trying to associate from the GUI:

init 3 && rm /home/olpc/.sugar/default/nm/networks.cfg  && init 5

we have not seen it fail (in over 20 attempts). This workaround only works if the correct password is entered on the first attempt. If you enter and invalid password you'll have to try again.

  Changed 5 years ago by jcardona

If you enter and invalid password you'll have to try again.

meant:

If you enter an invalid password you'll have to issue the shell commands again.

follow-up: ↓ 11   Changed 5 years ago by mbletsas

Using Ubuntu 8.04 on an XO (with the OLPC kernel from build 767), I observe that Network Manager doesn't always report the encryption on our WPA-PSK ESSIDs properly (networks show in the list without the encryption padlock). An attempt to associate with the WPA-PSK network fails in that case.

The problem is intermittent. If the network is listed as encrypted, attempting to associate with it succeeds.

The problem seems to be in NM's parsing of the encryption info since iwlist scan reports the correct information for the WPA-PSK ESSIDs

in reply to: ↑ 10 ; follow-up: ↓ 12   Changed 5 years ago by colin

Hi,

Try deleting /home/olpc/.sugar/default/nm/networks.cfg. Do you still see WPA networks show up without the padlock?

P.S. I have a patch to NetworkManager which improves the frequency with which association works. It's not perfect yet, though. There is still apparently some mechanism by which NM can become convinced that a WPA network is an unprotected network.

Colin

Replying to mbletsas:

Using Ubuntu 8.04 on an XO (with the OLPC kernel from build 767), I observe that Network Manager doesn't always report the encryption on our WPA-PSK ESSIDs properly (networks show in the list without the encryption padlock). An attempt to associate with the WPA-PSK network fails in that case. The problem is intermittent. If the network is listed as encrypted, attempting to associate with it succeeds. The problem seems to be in NM's parsing of the encryption info since iwlist scan reports the correct information for the WPA-PSK ESSIDs

in reply to: ↑ 11 ; follow-ups: ↓ 13 ↓ 15   Changed 5 years ago by mbletsas

Replying to colin:

Hi, Try deleting /home/olpc/.sugar/default/nm/networks.cfg. Do you still see WPA networks show up without the padlock? P.S. I have a patch to NetworkManager which improves the frequency with which association works. It's not perfect yet, though. There is still apparently some mechanism by which NM can become convinced that a WPA network is an unprotected network. Colin

Colin,

There is no sugar networks default file to delete in my setup (vanilla ubuntu with just the OLPC kernel). It is interesting that the padlock comes back and forth and it seems to only happen in WPA networks (not WPA2 or WEP)

M.

Changed 5 years ago by colin

workaround patch which improves success rate

in reply to: ↑ 12   Changed 5 years ago by colin

Replying to mbletsas:

Replying to colin:

Hi, Try deleting /home/olpc/.sugar/default/nm/networks.cfg. Do you still see WPA networks show up without the padlock? P.S. I have a patch to NetworkManager which improves the frequency with which association works. It's not perfect yet, though. There is still apparently some mechanism by which NM can become convinced that a WPA network is an unprotected network. Colin

Colin, There is no sugar networks default file to delete in my setup (vanilla ubuntu with just the OLPC kernel). It is interesting that the padlock comes back and forth and it seems to only happen in WPA networks (not WPA2 or WEP) M.

Hi,

This is a long shot, but... Do you have any access points with the same SSID? There is a known flaw in NetworkManager 0.6.xx where different access points with different security settings but the same SSID can be merged. The flaw is in the public API of NetworkManager and is not corrected until the 0.7 series.

Also... can you try my patch and see if the problem still appears? I posted it here. I can re-generate it for the slightly newer version of NM that you have, if you want.

Colin

  Changed 5 years ago by mbletsas

Colin,

Can you post some NM binaries also?

B.t.w. we do have 3 APs on the same ESSID here, however they all use the same security settings.

M.

in reply to: ↑ 12   Changed 5 years ago by jcardona

Replying to mbletsas:

Replying to colin:

Try deleting /home/olpc/.sugar/default/nm/networks.cfg.

There is no sugar networks default file to delete in my setup (vanilla ubuntu with just the OLPC kernel).

In ubuntu and others, the NM stores the list of preferred networks in GConf. You can probably access it by launching gconf-editor from the shell.

follow-up: ↓ 17   Changed 5 years ago by jcardona

OK, we believe we've found the root cause for this.

The sequence that leads to the failure is the following:

1. The NetworkManager applet will parse networks.cfg and pass the list of preferred networks to NetworkManager.

2. NM creates instances of network objects from the previous list. The newly created network objects are initialized with BSSID set to all zeros. (The BSSID passed by the applet is actually saved in a different field, user_address).

3. A scan result comes in which includes a null-BSSID network (this will happen around mesh nodes, as mesh management frames have null-BSSID).

4. NM matches the scan results with the preferred network lists by BSSID. The properties of the null-BSSID network are assigned to all the networks in the preferred list. If the AP we are trying to associate has some form of security, association will fail.

We are working on a fix in NM.

in reply to: ↑ 16 ; follow-up: ↓ 19   Changed 5 years ago by jcardona

Replying to jcardona:

We are working on a fix in NM.

In fact, it is possible to fix this in the driver by eliminating networks with all-zero BSSID in scan results. This is implemented in this patch: http://dev.laptop.org/~javier/patches/8799-patches/0006-libertas-Exclude-mesh-networks-from-main-interface.patch

But because the all-zero BSSIDs is a valid BSSID, I would recommend assigning an invalid BSSID (e.g. all ones) to uninitialized networks instead of all-zeros.

The patch that worked for us in NM was:

Index: src/nm-dbus-nmi.c
===================================================================
--- src/nm-dbus-nmi.c	(revision 4233)
+++ src/nm-dbus-nmi.c	(working copy)
@@ -458,6 +458,10 @@
 	/* Construct the new access point */
 	ap = nm_ap_new ();
 	nm_ap_set_essid (ap, essid);
+
+	struct ether_addr invalid_bssid = { {0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF} };
+	nm_ap_set_address (ap, &invalid_bssid);
+
 	nm_ap_set_security (ap, security);
 	g_object_unref (G_OBJECT (security));	/* set_security copies the object */

So either patch (or both) will address the problem we were seeing of Association Requests for WPA access points being issued with invalid security information.

While working on this ticket, we uncovered a number of other issues:

  1. redundant WPA key removals
  2. under some conditions the association work queue would be executed with nothing to do
  3. WPA key installation coupled with delayed association work queue
  4. wpa_supplicant clearing the BSSID would result in association requests to 'any' AP, adding unnecessary delays to WPA associations.

The patches posted here address these issues: http://dev.laptop.org/~javier/patches/8799-patches

With this patches things improve a lot: we are now always able to associate via the GUI, but sometimes it requires a few retries (more about this in a separate post).

  Changed 5 years ago by jcardona

The wpa_supplicant logs provide good information about what's going on. Let me document here the steps that we had to follow to get proper logs, as it's not straightforward.

wpa_supplicant is launched by NetworkManager, with hard coded parameters¹. In order to get logs, you must write a wrapper to set the right arguments:

# mv /usr/sbin/wpa_supplicant /usr/sbin/wpa_supplicant.real

Then create /usr/sbin/wpa_supplicant as:

#! /bin/bash

exec /usr/sbin/wpa_supplicant.real -g /var/run/wpa_supplicant-global -d -t \
        -f /var/log/wpa_supplicant.log

and make it executable

chmod a+x /usr/sbin/wpa_supplicant

Once you've done this, you should be able to monitor wpa_supplicant state transitions:

# tail -f /var/log/wpa_supplicant.log | grep State
1226112022.535501: State: DISCONNECTED -> ASSOCIATED
1226112022.541054: State: ASSOCIATED -> 4WAY_HANDSHAKE
1226112022.560642: State: 4WAY_HANDSHAKE -> 4WAY_HANDSHAKE
1226112022.562814: State: 4WAY_HANDSHAKE -> GROUP_HANDSHAKE
1226112022.575862: State: GROUP_HANDSHAKE -> GROUP_HANDSHAKE
1226112022.579278: State: GROUP_HANDSHAKE -> COMPLETED    # <-- this is good!
1226112028.712669: State: COMPLETED -> DISCONNECTED
...
1226112151.722583: State: DISCONNECTED -> SCANNING
1226112151.736786: State: SCANNING -> ASSOCIATING
1226112152.293060: State: ASSOCIATING -> 4WAY_HANDSHAKE
1226112152.309323: State: 4WAY_HANDSHAKE -> ASSOCIATED  # <-- this is bad!
1226112153.339174: State: ASSOCIATED -> DISCONNECTED

The second part of the log shows an instance where wpa_supplicant is notified that an association request succeeded while the 4WAY_HANDSHAKE was already under way.

We still do not have a solution for this.

--

[1] Apparently creating a file called /etc/NetworkManager/log-supplicant instructs NM to save the wpa_supplicant logs. But that doesn't work: we see wpa_supplicant invoked with the '-f' option, but without the required filename parameter.

in reply to: ↑ 17   Changed 5 years ago by mbletsas

  • cc deepak added

Deepak,

Can we get a test kernel with Javier's collection of patches?

Thanks,

M.

While working on this ticket, we uncovered a number of other issues: 1. redundant WPA key removals 2. under some conditions the association work queue would be executed with nothing to do 3. WPA key installation coupled with delayed association work queue 4. wpa_supplicant clearing the BSSID would result in association requests to 'any' AP, adding unnecessary delays to WPA associations. The patches posted here address these issues: http://dev.laptop.org/~javier/patches/8799-patches With this patches things improve a lot: we are now always able to associate via the GUI, but sometimes it requires a few retries (more about this in a separate post).

  Changed 5 years ago by mbletsas

  • cc dsaxena added; deepak removed

  Changed 5 years ago by dsaxena

I've pushed the patches to testing repo and we should see a build soon. Commit ID 011dd3f2d6fbd75ae2aacdcf8f99ebeddf814227. Javier, there are two 0004 and 0005 patches in that directory and I took the ones dated a few minutes later.

follow-up: ↓ 23   Changed 5 years ago by jcardona

These are the remaining failure modes that we currently see:

FM#1: keys do not get installed, AP sends deauth frame, firmware triggers deauth event (8) FM#2: sta sends deauth frame to the AP, reason code "STA leaving" (3) FM#3: there is no association command ever sent to the firmware (as if AP did not show up in scan results, but the AP appears in the scan logs) FM#4: transition from 4-WAY_HANDSHAKE -> ASSOCIATED

This is with firmware version 5.110.22.p22 and latest driver from testing branch. As previously reported, these failures are only observed when associating via the GUI to a WAP AP.

One hypothesis: we see scans taking place while associating. The association command is handled atomically by the firmware, but not the WPA authentication. If we scan in other channels during WPA authentication, we could miss EAPOL frames, which would have the same symptoms as FM#1 and possibly FM#2.

in reply to: ↑ 22   Changed 5 years ago by jcardona

Replying to jcardona:

One hypothesis: we see scans taking place while associating. The association command is handled atomically by the firmware, but not the WPA authentication. If we scan in other channels during WPA authentication, we could miss EAPOL frames, which would have the same symptoms as FM#1 and possibly FM#2.

Ruled out the above hypothesis by canceling queued up scan requests before associating. This did stop the scans that were taking place immediately after association, but did not improve the success rate.

  Changed 5 years ago by jcardona

I believe we've identified one more issue that is responsible for wpa_supplicant aborting associations.

The driver notifies the upper layers of certain MAC events, in particular, when association is lost. On some failures we see two such notifications from the driver, the second one resulting in wpa_supplicant aborting the association. This is how it looks from the driver side:

[362855.108231] libertas cmd: DNLD_CMD: command DEAUTH, seq 98, size 16, jiffies 36144169
[362855.112841]  Send DISCONNECT Event to Supplicant!! (response to deauth command)
[362856.096990] libertas cmd: DNLD_CMD: command SET CHANNEL, seq 99, size 48, jiffies 36144270
[362856.098844] libertas cmd: DNLD_CMD: command RSN ON-OFF, seq 101, size 12, jiffies 36144270
[362856.109423] libertas cmd: DNLD_CMD: command RSN ON-OFF, seq 102, size 12, jiffies 36144272
[362856.110431] libertas cmd: DNLD_CMD: command AUTHENTICATE, seq 103, size 25, jiffies 36144272
[362856.111397] libertas cmd: DNLD_CMD: command RADIO ON, seq 104, size 12, jiffies 36144272
[362856.112321] libertas cmd: DNLD_CMD: command ASSOCIATE, seq 105, size 92, jiffies 36144272
[362856.225899]  Send DISCONNECT Event to Supplicant!!  (firmware event)
[362857.347657] libertas cmd: DNLD_CMD: command SCAN, seq 110, size 63, jiffies 36144400
[362858.050860] libertas cmd: DNLD_CMD: command SCAN, seq 111, size 63, jiffies 36144475
[362858.717535] libertas cmd: DNLD_CMD: command SCAN, seq 114, size 63, jiffies 36144550

Why would the driver send a second event if we are already disconnected? Turns out the function that is responsible for sending the notification waits for one second before doing so (see code below). If the disconnect event from the firmware arrives in that window, two event notifications will be sent, one second apart from each other. (See the two 'Send DISCONNECT...' lines in the log above)

--- a/drivers/net/wireless/libertas/cmdresp.c
+++ b/drivers/net/wireless/libertas/cmdresp.c
@@ -31,18 +31,10 @@ void lbs_mac_event_disconnected(struct lbs_private *priv)
                return;
 
        lbs_deb_enter(LBS_DEB_ASSOC);
 
        memset(wrqu.ap_addr.sa_data, 0x00, ETH_ALEN);
        wrqu.ap_addr.sa_family = ARPHRD_ETHER;
 
-       /*
-        * Cisco AP sends EAP failure and de-auth in less than 0.5 ms.
-        * It causes problem in the Supplicant
-        */
-
-       msleep_interruptible(1000);
-       wireless_send_event(priv->dev, SIOCGIWAP, &wrqu, NULL);
-
        /* report disconnect to upper layer */
        netif_stop_queue(priv->dev);
        netif_carrier_off(priv->dev);
@@ -62,6 +55,8 @@ void lbs_mac_event_disconnected(struct lbs_private *priv)
        priv->numSNRNF = 0;
        priv->connect_status = LBS_DISCONNECTED;
 
+       wireless_send_event(priv->dev, SIOCGIWAP, &wrqu, NULL);
+
        /* Clear out associated SSID and BSSID since connection is
         * no longer valid.
         */

Fixing this does not resolve all the WPA failures but at least we don't see the double notification anymore.

  Changed 5 years ago by mstone-xmlrpc

  • keywords cjbfor9.1.0 added
  • milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

  Changed 5 years ago by cscott

  • keywords 8.2.1:+ added
  • milestone changed from 9.1.0 to 8.2.1

According to Michalis, this bug should be included in 8.2.1.

  Changed 5 years ago by cscott

Michalis has sent a kernel and firmware that claim to address this via email; that information should be added here. See also #8667.

  Changed 5 years ago by joe

  • cc mstone added

  Changed 5 years ago by joe

  • cc joe added

follow-up: ↓ 31   Changed 5 years ago by dsaxena

All the fixes for #8799, #8667, and #9048 should be in the following RPM, that also includes the Fedora logo update (#8767), psmouse tweaks (#9008), and some minor configuration and build changes that do not impact runtime.

http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20081217.24.olpc.31e049fce071030.i586.rpm

in reply to: ↑ 30   Changed 5 years ago by joe

Replying to dsaxena:

All the fixes for #8799, #8667, and #9048 should be in the following RPM, that also includes the Fedora logo update (#8767), psmouse tweaks (#9008), and some minor configuration and build changes that do not impact runtime. http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20081217.24.olpc.31e049fce071030.i586.rpm


Is it the same RPM as http://dev.laptop.org/~dsaxena/kernels/test/kernel-2.6.25-20081208.1.olpc.txpowertest.i586.rpm ?

  Changed 5 years ago by cjb

  • next_action changed from communicate to test in release

  Changed 5 years ago by cscott

  • next_action changed from test in release to add to build

Please put the appropriate RPMs in yr ~/public_rpms/staging on dev. (Also in joyride, if appropriate.)

  Changed 5 years ago by cjb

  • next_action changed from add to build to test in build

Verified that kernel-2.6.25-20081217.24 is present in staging-9.

  Changed 2 years ago by greenfeld

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

I have tested 10.1.3 and 11.2/3.0 with WPA-PSK and WPA2-PSK, and noticed no encryption-related association issues on any XO platforms.

(To the best of my knowledge, we don't support modes other than PSK in Sugar).

Note: See TracTickets for help on using tickets.