Ticket #5848 (new defect)

Opened 6 years ago

Last modified 6 years ago

losing WLAN circle after failing to connect to anything

Reported by: chihyu Owned by: carrano
Priority: high Milestone: 8.2.0 (was Update.2)
Component: wireless Version:
Keywords: 8.2.0:? Cc: kim, carrano, kimquirk, dsd, wad, mbletsas, yani, mtd
Action Needed: diagnose Verified: no
Deployments affected: Blocked By:
Blocking:

Description

build: joyride-1489, clean install

firmware: Q2D07

serial number: CSN7440007D

This is a follow up of 5847.

* when the machine was trying to connect to the school server, the WLAN circle showed up in the home view

* when the machine failed to connect to simple mesh, the grey circle disappeared comletely; please see the attached images

Attachments

failed_simple_mesh.png (45.5 kB) - added by chihyu 6 years ago.
failed to connect to simple mesh
circle_gone.png (42.8 kB) - added by chihyu 6 years ago.
grey circle disappeared completely
messages (59.0 kB) - added by wad 6 years ago.
/var/log/messages from a machine that is missing it's network circle.
messages2 (56.5 kB) - added by wad 6 years ago.
This is a log file from another laptop that failed to join a simple mesh after failing to find any mesh portal points.
706.connections.log (1.2 kB) - added by yani 6 years ago.
706.logs.CSN74400034.2008-05-28.23-01-44.tar.bz2 (66.7 kB) - added by yani 6 years ago.

Change History

Changed 6 years ago by chihyu

failed to connect to simple mesh

Changed 6 years ago by chihyu

grey circle disappeared completely

  Changed 6 years ago by dsd

  • cc dsd added

Same here on joyride 1516 using machine A05 in the testing area.

I had just updated from ship2.2 build 656 where things were working fine. I used the "autoreinstallation" USB update method, and held the square button to do a clean install. This seems to be a joyride regression.

Upon booting into the new joyride, wireless circle comes and goes a few times (tries to connect to mesh 1, looks for a school server, tries mesh 6), and then disappears. Same every time on 4 reboots. In this state I obviously cannot share/join any activites or access the internet.

If I select one of the mesh channels on the neighbourhood view, it connects successfully to a "simple mesh" i.e. no internet access.

If I select the media lab 802.11 infrastructure network, it connects successfully and I can go online.

  Changed 6 years ago by dsd

Ah, apparently I misinterpreted this ticket. This bug report is about a UI bug where the grey circle completely disappears from home view -- apparently it should always be present even when there is no connectivity.

My connectivity problems obviously belong somewhere else, although they seem to be fixed in joyride 1520.

  Changed 6 years ago by dsd

...and the bug is still present in joyride 1526 -- earlier today, sugar got in a confused state and no WLAN circle appeared on the home screen.

  Changed 6 years ago by jg

  • owner changed from jg to marco
  • component changed from distro to sugar
  • milestone changed from Never Assigned to Update.1

Hmmmm.... The next question marco will have is to get some sugar logs....

  Changed 6 years ago by marco

  • status changed from new to closed
  • resolution set to invalid

Yeah without a way to reproduce and without logs we can't do much about it. Please reopen if you can reproduce on the latest build and provide logs.

  Changed 6 years ago by wad

  • status changed from closed to reopened
  • resolution deleted

Reopening, as this happened to me with build 699.

This one is going to be fun to find...

Start with a fresh install of 699 on MP laptops. I had set up several simple meshes, but never associated them with an access point. The laptops were then associated with an access point (a mac laptop sharing an internet connection, possibly ad-hoc mode ?) When they were later turned on in an environment without that access point, all six laptops booted up without a simple mesh (missing network circle in home view).

Clicking on mesh channel 1 in the neighborhood view blinked the circle for a while, but didn't create a simple mesh. The circle didn't reappear.

Rebooting didn't fix the problem.

Removing /home/olpc/.sugar/default/nm/networks.cfg (which had a single entry for the mac laptop) appeared to fix the problem. But later, both rebooting and clicking on mesh 1 also fixed the problem (on other laptops).

I tried unsuccesfully to recreate the problem, by deleting /home/olpc/.sugar/default/nm/networks.cfg, accessing a regular access point, the rebooting while the access point was off, and a simple mesh was created fine.

The /var/log/messages file for one of the laptops is attached, the others were lost.

Changed 6 years ago by wad

/var/log/messages from a machine that is missing it's network circle.

  Changed 6 years ago by wad

I should have mentioned that if it is a UI bug, it is quite complete. Not only was the network circle missing, but no other laptops were shown in the neighborhood view (not even ones that thought they had a simple mesh on that channel).

The single access point in the environment was shown correctly.

  Changed 6 years ago by wad

  • cc wad added

Changed 6 years ago by wad

This is a log file from another laptop that failed to join a simple mesh after failing to find any mesh portal points.

  Changed 6 years ago by wad

In testing simple mesh operation using build 699 on 25 laptops, I saw this behavior happen on two laptops. The /var/log/messages log from one is attached as messages2.

In both cases, there was a zero length /home/olpc/.sugar/default/nm/networks.cfg file. In both cases, rebooting the machine restored proper operation.

  Changed 6 years ago by Blaketh

  • keywords release? added

  Changed 6 years ago by marco

  Changed 6 years ago by wad

Sugar logs and packet traces of this happening are available at:

http://wiki.laptop.org/go/Collaboration_Network_Testbed#Test_0321C

  Changed 6 years ago by wad

More logs and packet traces of this happening at:

http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0408F

  Changed 6 years ago by wad

See also:

School WiFi cases Here we see two cases when overloading the AP at startup: http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0410A

And then three tests in a row where the AP was not as loaded and no cases: http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0410B http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0410C

But I add artificial loading, and we see another case occuring! http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0410D

School mesh cases (made worse by firmware 22.p8 ?):

http://wiki.laptop.org/go/Collab_Network_School_Mesh_Tests#Test_0410G http://wiki.laptop.org/go/Collab_Network_School_Mesh_Tests#Test_0410H

Lots of logs and packet traces now...

  Changed 6 years ago by yani

This also occured with 706.

The XO running its first boot after the upgrade.

1h later, and without any intervention, the circle disappeared. (check connections.log for msh0/eth0 status at 18:40)

After being in this state, it couldnd associate to an AP(blinked once and stopped, several times) and when clicked to a mesh, it jumped between severel channel and finally settled. it worked fine after that.

** The significance here is that the bug occurred at a random point in time, without user intervention or reboot.

check the 706.** logs

  Changed 6 years ago by yani

  • cc kim, carrano added; kimquirk, dsd, wad removed

  Changed 6 years ago by yani

  • cc kimquirk, dsd, wad added

Changed 6 years ago by yani

Changed 6 years ago by yani

  Changed 6 years ago by carrano

There is something that is easily reproducible and may help understand what's going on.

(1) If you connect to an AP, the networks.cfg is populated and the next time you reboot, the XO will try to connect to this very AP.

(2) If this AP is not available (you turned it off or just moved to other location) and there is no other AP in the networks.cfg, the result will be no connection to anything and the symptom will be the absence of the circle as described here.

This seems related to the initialization routine of NetworkManager and will happen to any build.

follow-up: ↓ 21   Changed 6 years ago by wad

Ricardo,

It is not so simple (at least in builds before 704). Take a look at the tests listed earlier, and you will see that some number of laptops that failed to connect to a server formed a simple mesh. Others failed to start the simple mesh after looking for a previously seen AP. They had all previously been associated with an AP, and had an entry in networks.cfg. This is not consistent, but appears to be a race condition in NM that is sometimes triggered.

in reply to: ↑ 20   Changed 6 years ago by carrano

Replying to wad:

Ricardo, It is not so simple (at least in builds before 704). Take a look at the tests listed earlier, and you will see that some number of laptops that failed to connect to a server formed a simple mesh. Others failed to start the simple mesh after looking for a previously seen AP. They had all previously been associated with an AP, and had an entry in networks.cfg. This is not consistent, but appears to be a race condition in NM that is sometimes triggered.

Wad, I am just describing a scenario (which is easily reproducible) that causes the same symptom. Just a piece to the puzzle.

But do you mean "They had all previously been associated with an AP..." or They had *not* all previously been associated with an AP?

follow-up: ↓ 23   Changed 6 years ago by wad

All laptops in those tests had at one time been associated with an AP, and would have had an entry in networks.cfg. As you pointed out, there is definitely a connection there, just not a direct one (unless 704+ are more broken than earlier builds).

Note that difficulty associating with the AP/mesh, due to congestion, is also one of the necessary conditions (probably because NM never gets to the end of it's state machine if it succeeds in associating.)

in reply to: ↑ 22   Changed 6 years ago by yani

  • cc mbletsas added

Replying to wad and Ricardo:

In peabody most of the laptops were associated to an accesspoint in the past.

But only a few of them showed the bug.

Also, Wad do you recall that some of the XOs showed in the mesh view and/or in iwlist an old linksys AP, which was turned down at the time?

Do you believe there is some type of correlation?

Maybe the XOs who "saw" for some reason this obsolete AP tryed to connect to it, and eventually caused the situation described by ricardo.

Also, we had speculated an explation on this "ghost AP". There was a beacon from AP with FF:FF:FF...FF address, and the XO might have linked this to a random AP from the network.cfg If this is indeed true, it can be the source of many problems, failed associations.. etc

  Changed 6 years ago by yani

  • cc yani added

  Changed 6 years ago by carrano

Ok, so there is something we know for sure.

If the networks.cfg is populated, but the XO fails to connect to all of the access points listed there, it will not connect to anything (unless manually commanded) resulting in the no circle situation. We would expect it to at least connect to a simple mesh.

This behavior is something to fix in NetworkManager.

This does not account to all of the events described in this ticket, but it accounts for the most part of them. It is reproducible and it is definitely an unwanted behavior. So I suggest we start by checking NM and changing this behavior (failsafe should always be simple mesh). Right?

follow-up: ↓ 27   Changed 6 years ago by wad

Ricardo, we were not seeing this problem 100% with build 703. It sounds like you are stating that with a more recent build you are never getting to simple mesh if there is anything in networks.cfg.

This is new behavior, because we had many more laptops go to simple mesh than lose their network completely (this bug), when they failed to associate with a school mesh or AP, and had previously been associated with an AP.

I agree completely that this is NetworkManager, and had isolated it to one of two code paths, but couldn't find any glaring errors.

in reply to: ↑ 26   Changed 6 years ago by carrano

Replying to wad:

Ricardo, we were not seeing this problem 100% with build 703. It sounds like you are stating that with a more recent build you are never getting to simple mesh if there is anything in networks.cfg.

Not exactly. What I am saying is that if you fail to associate to the AP listed in networks.cfg you won't get to simple mesh. In my test bed of only 10 XOs, the only way I can make the association to fail is to change the configuration or turn off the AP.

This is new behavior, because we had many more laptops go to simple mesh than lose their network completely (this bug), when they failed to associate with a school mesh or AP, and had previously been associated with an AP.

In my tests, this happens the same with 703 or 706.

I agree completely that this is NetworkManager, and had isolated it to one of two code paths, but couldn't find any glaring errors.

My guess is that the "state machine" does not have a fallback to simple mesh in case it cannot connect to the APs listed in networks.cfg.

  Changed 6 years ago by mtd

  • cc mtd added

  Changed 6 years ago by carrano

I was *totally* wrong and Wad is correct (there is more to it).

Running Update.1-706 I could not reproduce the bug. I tried to:

1. Create a bogus entry on the network.cfg

2. Change security configuration for the AP (to wep, back to open).

3. Mac filtering the XO in the AP

All in order to have a failed connection to the AP and as a consequence reproduce this bug and collect logs. But *every* time the XO eventually and correctly connected to simple mesh.

There was clearly another condition that I was not aware of (and cannot find out now) that resulted in this no circle scenario.

Apologies for the misinformation. I am still trying to reproduce the bug.

  Changed 6 years ago by marco

  • keywords 8.2.0:? needs-testing added; release? removed

  Changed 6 years ago by mstone

  • keywords needs-testing removed
  • owner changed from marco to carrano
  • next_action set to diagnose
  • component changed from sugar to wireless
  • status changed from reopened to new

Any success reproducing this?

  Changed 6 years ago by mstone

(Incidentally, I picked the wireless component not because I blame it but simply because the wireless folks are the ones with the expertise to try to recreate this.)

Note: See TracTickets for help on using tickets.