Opened 6 years ago

Closed 4 years ago

Last modified 22 months ago

#9048 closed defect (fixed)

Wireless scanning causes network pauses

Reported by: johnf Owned by: dsaxena
Priority: high Milestone: 8.2.1
Component: wireless Version: not specified
Keywords: scan packet loss cjbfor9.1.0 Cc: dsaxena, jcardona, ashish, morgs, carrano, mtd, cjl
Blocked By: Blocking:
Deployments affected: au Action Needed: finalize
Verified: no

Description

Firmware: Q2E18
Build: 776

Packet loss occurs when the XO is performing wireless scans. This seems to happen automatically in the background but can also be replicated manually.

Start a fast ping

sudo ping -i 0.1 GATEWAY_IP

sudo iwlist eth0 scan

You will notice a drop of somewhere between 4-8 ICMP packets and the antennae light on the XO will flash.

While not normally noticable this causes problems for applications like VideoChat.

Attachments (4)

dmesg.656.out (121.9 KB) - added by morgs 6 years ago.
dmesg output on 656
dmesg.767.out (122.6 KB) - added by morgs 6 years ago.
dmesg output on 767
ping.767.out (5.0 KB) - added by morgs 6 years ago.
Output of the ping on 767 showing packet loss
0001-libertas-Update-Scanning-to-not-bring-down-interfac.patch (1.4 KB) - added by annaneal 6 years ago.

Download all attachments as: .zip

Change History (35)

comment:1 Changed 6 years ago by johnf

Temporary work around can be used to disable Network Manager wireless scanning

wpa_cli

ap_scan 0

comment:2 Changed 6 years ago by dsaxena

  • Cc dsaxena added

comment:3 Changed 6 years ago by mbletsas

  • Cc jcardona added

comment:4 Changed 6 years ago by ashish

  • Cc ashish added

comment:5 Changed 6 years ago by morgs

  • Cc morgs added

I tried this on 767 and did reproduce it. I tried it twice on 656 and could not reproduce it - no packets dropped - although I did see the ping times go up to around 200ms for several packets.

The 656 build has wireless firmware 5.110.20.p47, and the 767 build has 5.110.22.p18.

comment:6 Changed 6 years ago by ashish

Could you please attach dmesg log with 656 and 767 builds while you are reproducing this problem, with the following debug flag?
echo 0x206000 > /sys/module/libertas/parameter/libertas_debug

Changed 6 years ago by morgs

dmesg output on 656

Changed 6 years ago by morgs

dmesg output on 767

Changed 6 years ago by morgs

Output of the ping on 767 showing packet loss

comment:7 Changed 6 years ago by morgs

ashish: attached.

comment:8 Changed 6 years ago by jcardona

During scans the radio moves through all channels and so we should expect some losses in ingress traffic. There should be no losses for egress traffic, though, as the outgoing queue should be throttled by the driver/firmware.

Is it easy to retest and see where the losses are (e.g. by running tcpdump on the gateway)?

comment:9 follow-up: Changed 6 years ago by ashish

I believe during scan, firmware sends NULL frame with PM bit set to 1 to AP before moving off-channel to prevent frame loss. Therefore, I expect there should not be frame loss in ingress as well unless scan is taking too large and AP fails to buffer traffic.

One difference I could see in the attached dmesg log is the number of channels that are sent by the driver to firmware while doing progressive scan. In build 656, where it seems there is no frame loss, driver sends 1 or 2 channels per scan (command 0x6), whereas in build 767, driver sends 4-5 channels per scan.

comment:10 in reply to: ↑ 9 Changed 6 years ago by jcardona

Replying to ashish:

I believe during scan, firmware sends NULL frame with PM bit set to 1 to AP before moving off-channel to prevent frame loss. Therefore, I expect there should not be frame loss in ingress as well unless scan is taking too large and AP fails to buffer traffic.

Thanks Ashish. We've confirmed that with the sniffer, and yes, we see no loss of ingress traffic. Of course this only applies to infra traffic, as there is no way to prevent ingress losses on the mesh interface while scanning.

One difference I could see in the attached dmesg log is the number of channels that are sent
by the driver to firmware while doing progressive scan. In build 656, where it seems there is > no frame loss, driver sends 1 or 2 channels per scan (command 0x6), whereas in build 767,
driver sends 4-5 channels per scan.

The driver stops the interface queues before scanning. The purpose of that is to inform the networking stack that sits above the driver cannot currently process new packets. It would be
entirely valid behavior if the IP stack just dropped those packets: after all this is just best effort delivery.

We'll try to confirm that the packets are indeed dropped (and not buffered) at the network stack layer.

comment:11 in reply to: ↑ description Changed 6 years ago by johnf

I've just tested on build 767 with the firmware from 656 ie firmware 5.110.20.p47 and the issue is still there. So maybe this narrows it down to a driver issue

comment:12 Changed 6 years ago by johnf

Joyride 2567 also has the same problem

comment:13 Changed 6 years ago by greebo

Found that normally the XOs drop 4 packets every 130 seconds. This isn't a problem for most of the XO applications, however when running videochat between two XOs you get a complete dropout of audio/video everytime either laptops scans. This isn't a problem on an AP network when scanning is turned off.

When connecting XOs to a normal access point, you can stop the scans by running "wpa_cli ap_scan 0" (however if you put ap_scan=0 into /etc/wpa_supplicant/wpa_supplicant.conf it doesn't work).

However, this doesn't work when laptops are connecting to an Active Antennae. We could not determine how to stop the scanning when an AA is in use.

Although videchat isn't widely used atm, it is an extremely important tool for distance education, connecting classrooms, delivering special education to children without local resources and much more.

comment:14 Changed 6 years ago by greebo

Please note that ultimately the scanning process should drop packets (it doesn't on any of the other non-XO machines we've looked at) so ultimately this needs to be fixed rather than worked around :)

comment:15 Changed 6 years ago by carrano

  • Cc carrano added

As already noted, there is loss of egress traffic while the interface is scanning.

As expected, in traffic captures the loss can be associated to the sending of the two null frames (type/subtype 0x24) - the one with PWR MGT flag set (going to sleep) and the one with the flag unset (coming back) - that "protect" the scanning period (but only for ingress traffic).

The typical interval between these two frames is 0.45 seconds, hence the 4 to 5 frames lost in ping -i .1 <...>.

The pattern for the null frames is very consistent, two groups separated by 10 seconds, each of these groups formed of 4 pairs of null frames (going to sleep and returning), each pair separated by 0.75 seconds, and with sleep time of 450ms (but the last one, which lasts for 120ms). The pattern is:

0.00 (450ms)

0.75 (450ms)

1.50 (450ms)

2.25 (120ms)

(repeat after 10 seconds)

This is consistent with the scanning commands sent to the firmware. In each group, four scanning commands are sent to the firmware. For the first three commands, four channels are scanned (hence the 450ms), and one channel is scanned in the last command (hence the shorter period of 120ms), accounting for 13 channels.

After 110 seconds the process restarts, because NetworkManager demands.

It is interesting to note that the losses does not occur at every time the interface scans. (if this was the case, at every 110 seconds the losses would be of about 29 to 30 packets, not 4 to 5.

When there is scanning and no loss of egress traffic is noted there is high associated jitter in the icmp traffic, indicating that the frame was buffered for long enough (.5 seconds) before being sent.

This all points to insufficient buffer capabilities to keep frames during this .45 seconds *under high traffic conditions*.

It is not clear to me if (honest questions, there is probably a good reason):

  • we need NetworkManager to scan for the interface at every 110 seconds.
  • we need to do it twice for each request (separated by 10 seconds).
  • we need to scan 4 channels at a time (taking a long time to do so). Why we don't scan 4-3-3-3, channels, instead of 4-4-4-1, or maybe we should scan 3-3-3-2-2?

comment:16 follow-up: Changed 6 years ago by annaneal

I have found that before the scan starts the driver turns off the carrier as
well as stops the transmit queue.

As a quick test I modified the code to leave the carrier on through out the scan and tried
the ping test again. The ping test with the iwlist scan in the background only looses
one packet on every other test instead of 8 packets consistently.

I'm currently investigating why and if turning the carrier off during the scan
is necessary.

comment:17 in reply to: ↑ 16 Changed 6 years ago by mbletsas

I'm currently investigating why and if turning the carrier off during the scan
is necessary.

Better receiver sensitivity.

M.

comment:18 Changed 6 years ago by annaneal

I have prepared a patch.

Upon investigation I found that netif_carrier_off/on is causing the
interface to be brought down during each scan. It is not necessary to bring
down the interface to stop it from transmitting during a scan. The
netif_stop_queue command is used for this. The netif_wake_queue command will
need to be called to make sure the queue wakes up and transmits anything that
has been waiting.

Removing the netif_carrier_off/on calls will leave the packet buffering up to
the kernel. This means that if the buffer fills up, some packets may still be
lost. Because the scan takes a short amount of time this should not cause too
much interruption.

In addition, when looking at other device drivers and the mac80211
stack the carrier is never turned off during a scan.

After removing the netif_carrier_off/on calls in the scan routine tests show
that only one packet is lost at every other scan. This is in contrast to a
consistent 8 packet loss on every scan.

I have attached a patch to this ticket. In this patch I removed the
carrier_off/on calls.

Anna

comment:19 Changed 6 years ago by ashish

I have a few comments.

  1. 767 build driver sends around 4 channels per scan command, which increases scan latencies. I guess, if you reduce this to 2/3 channel per scan it may further improve.
  2. The active scan spends around 100ms on each channel. I think 100ms time is too large for active scan and I guess smaller number e.g., 40ms should be fine.

comment:20 Changed 6 years ago by dsaxena

I will commit Anna's patch to our testing kernel repository so we can get some RPMs for widespread testing.

comment:21 Changed 6 years ago by mstone-xmlrpc

  • Keywords cjbfor9.1.0 added
  • Milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

comment:22 follow-up: Changed 6 years ago by cjb

  • Milestone changed from 9.1.0 to 8.2.1
  • Owner changed from mbletsas to dsaxena

I think this is going to end up in 8.2.1 by virtue of being pulled in with dsaxena's latest wireless changes. Deepak, does that sound right? Should I retarget this bug to 8.2.1, then?

comment:23 in reply to: ↑ 22 Changed 6 years ago by dsaxena

Replying to cjb:

I think this is going to end up in 8.2.1 by virtue of being pulled in with dsaxena's latest wireless changes. Deepak, does that sound right? Should I retarget this bug to 8.2.1, then?

ACK.

All the fixes for #8799, #8667, and #9048 should be in the following RPM, that also includes the Fedora logo update (#8767), psmouse tweaks (#9008), and some minor configuration and build changes that do not impact runtime.

http://dev.laptop.org/~dilinger/testing/kernel-2.6.25-20081217.24.olpc.31e049fce071030.i586.rpm

comment:24 Changed 6 years ago by cjb

  • Action Needed changed from never set to add to release

comment:25 Changed 6 years ago by cscott

  • Action Needed changed from add to release to add to build

Please put the appropriate RPMs in yr ~/public_rpms/staging on dev. (Also in joyride, if appropriate.)

comment:26 Changed 6 years ago by cjb

  • Action Needed changed from add to build to test in build

Verified that kernel-2.6.25-20081217.24 is present in staging-9.

comment:27 Changed 6 years ago by dsd

  • Action Needed changed from test in build to finalize

Confirmed fixed in staging-9. Ran the fast ping on 1 console, and a "while true; do sleep 0.5; iwlist eth0 scan" on another. I terminated the ping loop after 1503 iterations, and it reported 0% packet loss.

comment:28 Changed 6 years ago by mtd

  • Cc mtd added

comment:29 Changed 5 years ago by cjl

  • Cc cjl added

Per comment 23 by dsaxena and comment 27 by dsd, I think this ticket should probably be closed as fixed. Does anyone have any more recent input?

comment:30 Changed 4 years ago by dsd

  • Resolution set to fixed
  • Status changed from new to closed

comment:31 Changed 22 months ago by sridhar

  • Deployments affected changed from Australia to au
Note: See TracTickets for help on using tickets.