Ticket #1005 (closed defect: fixed)

Opened 8 years ago

Last modified 8 years ago

Ping failure in a 13 Node mesh (adhoc) scenario.

Reported by: rchokshi Owned by: jcardona
Priority: blocker Milestone: Trial-1
Component: wireless Version:
Keywords: Cc: javier@…, Marcelo@…, luisca, mbletsas, GR-Wireless-OLPC@…
Action Needed: Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

1. UUT info:

Hardware driver libertas version COMM-USB8388-320.p0-dbg fw version 5.110.9.p10 Bus-info: Linux Fedora

2. Test bed info:

Four OLPC Laptops, Netgear AP

3. Test environment info:

Open air and shielded room

4. Bug reproduction info

Start one OLPC mesh node with a fixed SSID and check the SSID in the beacon. Configure the other twelve OLPCs in the adhoc format so that they join the same BSSID and ensure that all the mesh points ping each other. Pings from multiple(6/7) laptops fail within an hour.

Attachments

host_unreachable_mshy5.txt Download (10.7 KB) - added by jcardona 8 years ago.
State dump at one of the host where ping failed.
host_unreachable_mshy7.txt Download (9.2 KB) - added by jcardona 8 years ago.
State dump at second host where ping failed.
1005-test-capture.png Download (49.7 KB) - added by jcardona 8 years ago.

Change History

  Changed 8 years ago by jcardona

  • owner changed from rchokshi to jcardona

Is this temporary or permanent?

What is the error message if any?

Is 13 the minimum number of laptops required to see this problem?

  Changed 8 years ago by jcardona

  • cc swagle@… added
  • status changed from new to assigned

-- Based on a conversation with Subir, at Marvell's QA lab. --

The exact scenario in which this bug has been observed is as follows:

Machines 1 to 13, all within direct range, no blinding.

1. Start the following pings:

1 -> 2
2 -> 1,3
3 -> 1,4
4 -> 1
5 -> 12
6 -> 12
7 -> (none)
8 -> 9
9 -> (none)
10 -> 9
11 -> 8,12
12 -> (none)
13 -> (none)

2. After 12 hours, ping was failing on machines 1,2,3 and 4. The error

Destination Host Unreachable

then

ping: No buffer space available

All were running firmware 5.220.9p10, but 5, 6, 8, 9, 11 and 12 had an older kernel (2.6.19) without the fix for ticket #915 (Subir was hit by ticket #991)

Suggesting retest with latest kernel.

  Changed 8 years ago by jcardona

The Destination Host Unreachable error appears to be temporary problem: retrying the same node after some time will work. Also the UUT can still ping other nodes when one destination becomes Unreachable.

This seems to depend on (1) the dense radio environment or (2) the number of nodes in the mesh. The fact that this has also been observed on OLPC tesbed (13 nodes) during daytime seems to indicate that (1) is a necessary condition to trigger this behavior. Suggesting retest in shielded environment with a large mesh to confirm this.

  Changed 8 years ago by jcardona

  • cc luisca, mbletsas added

New tests planned. The goal of this is to determine if the mesh protocol is too sensitive to a noisy environment or if the problem is related to mesh size. These tests are:

  1. repeat test with a 13 node mesh on a shielded environment.
  2. repeat test with a 5 node mesh on a noisy environment.

For 2, we are modifying the blinding table so it can be negated, i.e. only accept traffic from the hosts in the bt_list. This will allow us to easily create small, isolated mesh networks on a larger test bed.

Changed 8 years ago by jcardona

State dump at one of the host where ping failed.

Changed 8 years ago by jcardona

State dump at second host where ping failed.

follow-up: ↓ 10   Changed 8 years ago by jcardona

Able to reproduce this at OLPC testbed at 7:30 PM Boston time (can that be considered a noisy hour?).

Captured the state on the two nodes that could not ping each other. Confirmed that pinging other nodes still works. The nodes that could not ping each other were meshy5 and meshy7. Looking at the forwarding tables, one can see that there is no direct route from meshy7 to meshy5. Adding a direct route manually resolved restored the pings.

From mshy5 (10.2.51.50) icmp_seq=4427 Destination Host Unreachable
From mshy5 (10.2.51.50) icmp_seq=4429 Destination Host Unreachable
From mshy5 (10.2.51.50) icmp_seq=4430 Destination Host Unreachable
From mshy5 (10.2.51.50) icmp_seq=4431 Destination Host Unreachable
64 bytes from mshy7 (10.2.48.17): icmp_seq=4432 ttl=64 time=1772 ms
64 bytes from mshy7 (10.2.48.17): icmp_seq=4433 ttl=64 time=772 ms
64 bytes from mshy7 (10.2.48.17): icmp_seq=4434 ttl=64 time=14.8 ms
64 bytes from mshy7 (10.2.48.17): icmp_seq=4435 ttl=64 time=7.24 ms

Resetting the forwarding table at meshy7. Able to ping mshy1 (route rediscovered), but not mshy5.

follow-up: ↓ 7   Changed 8 years ago by jcardona

luisca has a good theory for this:

0. Consider that the lowest rate RREQs are sent @ 11Mbps.

If

  1. a link degrades to the point that only works at R Mbps < 11Mbps

AND

  1. at one point in time, a frame over that link fails to be transmitted

Because of (2) the route is deleted. And because (0) and (1), the route is never re-discovered. Adding the route manually will restore the link, because rate adaptation will lower the tx rate to R.

Suggested fix: modify route discovery to generate RREQs that go as low as 1 Mbps.

in reply to: ↑ 6 ; follow-ups: ↓ 8 ↓ 12   Changed 8 years ago by mbletsas

This is a good theory and we should make sure that we can control the transmission rate. Also, given that RREQs don't get ack'ed, it is a good idea in principle to transmit them at lower rates (I thought that was the case to begin with).

However, I don't believe it is the only source of our problems.

Here's my experience today using meshy9 and meshy10, which are across the room from each other (4m apart).

I was doing scp transfers between them (succesfuly). After a few idle minutes, I went back and couldn't ping mshy9 from mshy10. I checked the forwarding table and there was a direct route from mshy10 to mshy9 but I still couldn't ping it.

M.

in reply to: ↑ 7 ; follow-up: ↓ 9   Changed 8 years ago by mbletsas

I was doing scp transfers between them (succesfuly). After a few idle minutes, I went back and couldn't ping mshy9 from mshy10. I checked the forwarding table and there was a direct route from mshy10 to mshy9 but I still couldn't ping it.

I guess I should have checked for a route from mshy9 to mshy10 also...

M.

in reply to: ↑ 8   Changed 8 years ago by mbletsas

I guess I should have checked for a route from mshy9 to mshy10 also...

There are routes, am I missing something?

M.

[root@mshy10 ~]# ./fwt_list.sh msh0|grep -i 00:17:C4:01:19:2A
00:17:c4:02:30:10  00:17:c4:01:19:2a    28    0    487      0      3           7   159807340    0   66  
00:17:c4:02:30:11  00:17:c4:01:19:2a    48    0  10058      0      2           8   160393040    0   66  
00:17:c4:02:30:11  00:17:c4:01:19:2a    41    1   1313   9950      2           0   159804640    0   53  
'''00:17:c4:01:19:2a  00:17:c4:01:19:2a    13    1   1319   1409      1           0   160353490    0   53'''  
00:17:c4:02:33:32  00:17:c4:01:19:2a    17    0    402      0      3           7   156665260    0   53  
00:17:c4:02:33:32  00:17:c4:01:19:2a    39    1   1314    407      3           0   159805740    0   53  
00:17:c4:01:19:43  00:17:c4:01:19:2a    28    0  30279      0      3           7   160403720    0   53  
00:17:c4:01:19:43  00:17:c4:01:19:2a    56    1   1288  26451      4           0   153062580    0   55  
00:17:c4:01:19:85  00:17:c4:01:19:2a    26    0  11867      0      2           8   160387210    0   55  
00:17:c4:01:18:88  00:17:c4:01:19:2a    28    0  24758      0      3           7   160240500    0   55  
00:17:c4:02:30:92  00:17:c4:01:19:2a    28    0    943      0      3           7   156966350    0   55  
00:17:c4:02:2f:af  00:17:c4:01:19:2a    45    1  11810  28990      4           0   158986530    0   55

[root@mshy9 ~]# ./fwt_list.sh msh0|grep -i 00:17:C4:02:2D:EE
00:17:c4:02:30:11  00:17:c4:02:2d:ee    17    0  10111      0      3           7   160564450    0   51  
00:17:c4:02:30:11  00:17:c4:02:2d:ee    41    1   1389    893      2           0   149949480    0   51  
00:17:c4:02:33:32  00:17:c4:02:2d:ee    28    0    401      0      3           7   156620520    0   51  
00:17:c4:01:19:43  00:17:c4:02:2d:ee    59    0  30473      0      2           8   160750220    0   51  
00:17:c4:01:19:85  00:17:c4:02:2d:ee    26    0  11810      0      2           8   158976640    0   51  
00:17:c4:01:18:88  00:17:c4:02:2d:ee    30    0  24975      0      4           6   160719470    0   51  
00:17:c4:02:2f:af  00:17:c4:02:2d:ee    43    1  11840  29028      3           0   159676210    0   50  
00:17:c4:02:2d:ee  00:17:c4:02:2d:ee    13    0   1319      0      1           9   160343830    0   50  
'''00:17:c4:02:2d:ee  00:17:c4:02:2d:ee    13    1    943   1306      1           0   156956720    0   50'''  
00:17:c4:02:2d:ee  00:17:c4:00:cc:b2    26    0   1172      0      2           8   149910830    0   48  
00:17:c4:02:2d:ee  00:17:c4:01:19:85    26    1   1395   1194      2           0   149958270    0   40  
[root@mshy9 ~]# ping mshy10
PING mshy10 (10.2.45.238) 56(84) bytes of data.
From mshy9 (10.1.25.42) icmp_seq=1 Destination Host Unreachable
From mshy9 (10.1.25.42) icmp_seq=2 Destination Host Unreachable
From mshy9 (10.1.25.42) icmp_seq=3 Destination Host Unreachable

--- mshy10 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4020ms
, pipe 3

in reply to: ↑ 5 ; follow-up: ↓ 11   Changed 8 years ago by mbletsas

Replying to jcardona:

Able to reproduce this at OLPC testbed at 7:30 PM Boston time (can that be considered a noisy hour?).

7:30p is probably half way between very noisy (10am-5pm) and very quiet (midnight-8am)

M.

in reply to: ↑ 10   Changed 8 years ago by jcardona

Replying to mbletsas:

Replying to jcardona:

Able to reproduce this at OLPC testbed at 7:30 PM Boston time (can that be considered a noisy hour?).

7:30p is probably half way between very noisy (10am-5pm) and very quiet (midnight-8am) M.

Yes, there are routes in this case. There was no return route on the failure reported in my comment @3/07/07 20:05:50, as can be seen in the attached files. We are currently testing a new firmware build in which RREQs go all the way down to 1 Mbps.

in reply to: ↑ 7   Changed 8 years ago by jcardona

Replying to mbletsas:

This is a good theory and we should make sure that we can control the transmission rate. Also, given that RREQs don't get ack'ed, it is a good idea in principle to transmit them at lower rates (I thought that was the case to begin with).

RREQs are sent at decreasing rates. The best rate heard will be used as the data transfer rate for the link between the RREQ sender and recipient.

  Changed 8 years ago by jcardona

  • component changed from distro to wireless

  Changed 8 years ago by jcardona

We tested a version of the firmware (5.220.9.p254) that implements the change described in  http://dev.laptop.org/ticket/1005#comment:6. After overnight testing on OLPC testbed the logs looked very good, some brief instances of Destination Unreachable logged, but all nodes were still pinging.

Run the script below on all nodes, and they could all ping all other nodes.

for i in `seq 1 14`; do ping -q -c 5 mshy$i; done | grep -e '[^0]0% packet loss' | wc -l

This unreleased firmware is installed on all the xo's on the OLPC testbed. Next step is to have Marvell's QA team to retest.

Changed 8 years ago by jcardona

  Changed 8 years ago by swagle@…

We(Andy and myself) reproduced the same problem.Did not see much of a change.We did notice clients moving out of the adhoc mode into managed mode this time though.Out of 10,three did.

The easiest way of encountering the buffer overflow problem is associating two nodes in the mesh mode and then ping an unexisting IP in the same subnet.The buffer overflow problem should happen within 10 minutes or so.

  Changed 8 years ago by luisca

Subir,

we need you to be more descriptive about your test results. Were all nodes failing to ping or only the three nodes that transitioned into managed mode stopped pinging? In the latter, case, this would be just a duplicate of bug #972, transitioning to managed mode will most likely change the rf channel and will prevent the nodes from communicating with the others. Setting a new non-existent ad-hoc essid should fix it. It is also possible that the nodes entered frozen mode (#914) or crashed (like #526, maybe due to a different cause), which are separate bugs too.

It would be useful too if you describe what nodes can or cannot do when they do not work, whether the node resolve IPs to MACs, whether other nodes can resolve the IP of the broken node, the ping succeeds if you manually add a direct route, and so on.

  Changed 8 years ago by swagle@…

Javier,

Only the three nodes transitioning into the managed mode stopped pinging. These nodes displed the error Destination Host Unreachable for a while and later on move into the ping buffer overflow mode.From this point it is not possible to recover the node and the only way is to restart the mesh mode on the node using the command

iwconfig msh0 mode ad- essid a

One interesting point that I saw when I restarted the node is that it still fails to ping the node it was earlier pinging.I have to manually delete the arp entry from the table only after which the ping resumes.The restarted node can ping any other node in the mesh since the arp entry for the other node is not present.

I did not see the problem of the node staying in the ad-hoc mode and failing on the ping as we used to see earlier.I'm planning to increase the number of nodes and repeat this test.Please let me know if you need any other information.

  Changed 8 years ago by rchokshi

  • cc GR-Wireless-OLPC@… added; swagle@… removed

  Changed 8 years ago by luisca

Subir,

in that case the problem you are having now is bug #972.

About the arp problem, it sounds strange. Having already the IP resolved should do no harm in any case, as long as you are not changing IPs on the nodes. Even in that case an incorrect arp entry would timeout and be refreshed. Maybe it was just a timing issue.

Anyway, I propose to close this bug as the problems you see now are due to #972 and keep working on figure out what is triggering the transitions. Do you agree?

  Changed 8 years ago by jg

  • priority changed from normal to blocker
  • milestone changed from Untriaged to Trial-1

follow-up: ↓ 22   Changed 8 years ago by jg

Luisca? can this be closed?

in reply to: ↑ 21   Changed 8 years ago by luisca

Replying to jg:

Luisca? can this be closed?

The original bug was fixed, but Michail saw yesterday a similar problem in the testbed and we were able to reproduce it today. We are right now testing simultaneously a new firmware on two different setups, one with 18 XOs and the other with 10 XOs. Will close tomorrow if tests go smooth overnight.

  Changed 8 years ago by luisca

The problem did not show up after 14h testing on both test beds. We will keep the tests running and close the bug at the end of the day if everything goes fine.

  Changed 8 years ago by luisca

Tests keep running fine and we did some tests to confirm that the bugfix really addresses the source of the problem (an unhandled error when running out of timers). Will make a new release within the next few hours.

  Changed 8 years ago by luisca

  • status changed from assigned to closed
  • resolution set to fixed

Fixed in 5.220.10p4.

Note: See TracTickets for help on using tickets.