Ticket #10694 (closed defect: fixed)

Opened 3 years ago

Last modified 2 months ago

After XO-1 sleep, there is a 3-4min delay before wireless access points appear

Reported by: erikos Owned by: dsd
Priority: normal Milestone: Future Release
Component: network manager Version: Development build as of this date
Keywords: Cc: sascha_silbe
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

see detailed description at: http://bugs.sugarlabs.org/ticket/2421

Attachments

lbs-scan-assoc-wait.patch (3.0 kB) - added by dsd 3 years ago.

Change History

Changed 3 years ago by dsd

I can't reproduce this, can you?

Changed 3 years ago by dsd

I moved house a few times and now I can reproduce this!

This is quite a tangled issue. And its racy, so it doesn't *always* happen. And you won't see it so much if your AP is on a channel lower than 5.

When the system goes into suspend, the device goes away as the hardware gets powered down. However NetworkManager does not get to act on this until system resume time.

The system resumes, the new network device appears and NetworkManager immediately kicks off a scan.

Then, NM and wpa_supplicant take care of the old device going away. NM tells wpa_supplicant to remove the interface, and wpa_supplicant tries to disconnect and stuff. We get to:

static void wpa_driver_wext_disconnect(struct wpa_driver_wext_data *drv)
{
	const u8 null_bssid[ETH_ALEN] = { 0, 0, 0, 0, 0, 0 };
	u8 ssid[32];
	int i;

	/*
	 * Clear the BSSID selection and set a random SSID to make sure the
	 * driver will not be trying to associate with something even if it
	 * does not understand SIOCSIWMLME commands (or tries to associate
	 * automatically after deauth/disassoc).
	 */
	wpa_driver_wext_set_bssid(drv, null_bssid);

	for (i = 0; i < 32; i++)
		ssid[i] = rand() & 0xFF;
	wpa_driver_wext_set_ssid(drv, ssid, 32);
}

Unfortunately, wext is quite badly designed here. The wext ioctls take a string to identify which network interface you want to act upon - i.e. "eth0"

So even though wpa_supplicant is trying to disconnect the eth0 that went away as we went into suspend, in which case we would want these operations to be a no-op/error (acting on dead device), actually eth0 has gone away and come back as a new device, and these operations take effect on the new eth0.

As the function shows, to disconnect, wpa_supplicant asks for association to a random SSID. This request reaches libertas, which decides to abort the current all-channel all-SSID scan (which by this time is only partially complete and usually has only scanned about one third of the channels), and replace it with a SSID-specific scan for the random SSID that wpa_supplicant requested. This SSID-specific scan completes with no results (surprise surprise) so the driver then presents the results of the first (incomplete) scan, the results of which are incomplete. The next scan won't happen for a few minutes, leading to this experience where most APs only appear after waiting a while.

This problem would go away with F15+ and a newer kernel, where nl80211 would be used instead of wext inside wpa_supplicant. nl80211 sensibly uses ifindex, not interface name, to identify which device you want to talk to.

Nevertheless, I'll see if I can untangle this inside of libertas.

Changed 3 years ago by erikos

Thanks a lot Daniel for debugging this in that detail, great work!

Changed 3 years ago by dsd

In the upstream kernel, this is already working well. There are some corner-cases not handled well in the cases of starting a scan and then immediately starting association, but this bug is not present and the use of nl80211 avoids the odd behaviour described above anyway. I'll work to make this work perfectly upstream.

For olpc-2.6.35, this is hard to fix. I'm attaching my attempt of making association defer itself until ongoing scans have finished. The problem is that association is inherently scanning too, and the get_scan handler won't send any results to userspace while a scan (or an association) is ongoing. Userspace gives up waiting for the scan results. This is fixed upstream thanks to cfg80211 which has a nicer separation of these concepts. It would be a lot of effort to fix this in 2.6.35, so I think I'll probably push it off to a later cycle (where we'll have a new kernel).

Changed 3 years ago by dsd

Changed 3 years ago by dsd

  • milestone changed from 11.3.0 to Future Release

Decided that this isn't worth fixing for 2.6.35.

As mentioned above, the upstream kernel behaviour is much much improved here. A patch has just gone upstream "libertas: scan behaviour consistency improvements" which fixes the only inconsistency I could find on that setup.

Changed 2 years ago by sascha_silbe

  • cc sascha_silbe added

Changed 2 months ago by Quozl

  • status changed from new to closed
  • next_action changed from diagnose to no action
  • resolution set to fixed

Review: #12757 is very likely to have been a cause of some of the behaviour in #10694. The patch described above is in our kernel now as cc02681. Closing #10694.

Note: See TracTickets for help on using tickets.