Ticket #7415 (closed defect: fixed)

Opened 6 years ago

Last modified 6 years ago

Neighborhood view shows my AP twice

Reported by: morgs Owned by: dsd
Priority: high Milestone: 8.2.0 (was Update.2)
Component: sugar Version: Development build as of this date
Keywords: blocks:8.2.0 r+ Cc: sjoerd, mstone, mtd, gregorio, erikg, dcbw, marco, erikos, wad, kimquirk, joe
Action Needed: approve for release Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Joyride 2097, on reboot after upgrading from much earlier joyride:

I see my AP twice in Neighborhood view. One of them shows Connected, and the other shows Connect on the secondary palette.

Perhaps this is the same issue as http://lists.laptop.org/pipermail/sugar/2008-June/006700.html - since my AP is in networks.cfg and I no longer see my neighbor's AP on my machines - he probably stopped broadcasting his ESSID.

Attachments

changes.txt (4.7 kB) - added by erikg 6 years ago.
changelog in from sugar 0.75.14 to 0.75.15, produced via git diff 0a7ad2022..v0.75.15 (thanks to dsd)
nm-tool.IIEE_doubled (1.7 kB) - added by erikg 6 years ago.
output of nm-tool when we see two ap icons on the neighborhood view
0001-Avoid-network_appeared-race.patch (1.7 kB) - added by dsd 6 years ago.
0001-Avoid-adding-already-appeared-networks-twice.patch (1.5 kB) - added by dsd 6 years ago.

Change History

Changed 6 years ago by morgs

This might be a general Network Manager issue: I have recently seen APs appear on my ubuntu laptop that I last used ages ago elsewhere in the world.

Changed 6 years ago by morgs

  • cc sjoerd, mstone added
  • owner changed from marco to dcbw
  • component changed from sugar to network manager

Martin Langhoff replied on sugar@:

I agree. My ubuntu hardy laptop has the same vice - there's an old ESSID (from an AP in Mexico DF!) that appears sometimes, probably when there are APs with empty ESSIDs. I call NM bug.

Reassigning to network mangler.

Changed 6 years ago by gregorio

  • milestone deleted

Milestone Never Assigned deleted

Changed 6 years ago by morgs

  • owner changed from dcbw to sjoerd

Changed 6 years ago by erikg

Build image per711-1 exhibits this issue, but per703-6 does not. I have verified via testing that this is the case. Peru would really not like to deploy many thousands of laptops with the issue, so I started investigating. The issue appears between sugar-0.75.14-1.olpc2 and sugar-0.75.15-1.olpc2, the only version bump between per703-6 and per711-1 in apparently related software. (NetworkManager is the same version on both builds).

Changed 6 years ago by erikg

changelog in from sugar 0.75.14 to 0.75.15, produced via git diff 0a7ad2022..v0.75.15 (thanks to dsd)

Changed 6 years ago by morgs

erikg, did you have identical nm/networks.cfg on per706-3 and per711-1?

Changed 6 years ago by morgs

erikg, from git log the tickets fixed in that changeset are: #5904, #3486, #3611 and #4084. I don't see anything relevant to this...

Changed 6 years ago by erikg

output of nm-tool when we see two ap icons on the neighborhood view

Changed 6 years ago by kimquirk

  • keywords blocks?:8.2.0 added
  • priority changed from normal to high
  • milestone set to 8.2.0 (was Update.2)

I'm upgrading this item so we can connect it to the other trac items related to the change in neighborhood view, which are blocking.

Changed 6 years ago by mstone

For historical perspective, note that something like this issue was reported as early as 10 months ago in #4267.

Changed 6 years ago by mtd

  • cc mtd added

This may be related to #6944 or #2866. Solving those will probably make this easier to debug, as the information available to the GUI model/view classes will be better used.

Changed 6 years ago by gregorio

  • cc gregorio added
  • keywords blocks:8.2.0 added; blocks?:8.2.0 removed

Hi Guys,

I need to upgrade this to a blocker since Peru mentioned it as a deal breaker for them.

Sjoerd,

Can you work on this as your top priority? Let me know if you are working on a different blocker which will have to wait for this one.

Thanks,

Greg S

Changed 6 years ago by dsd

  • owner changed from sjoerd to dsd

I'll work on this

Changed 6 years ago by dsd

Changed 6 years ago by dsd

  • cc erikg, dcbw added
  • next_action changed from diagnose to review
  • component changed from network manager to sugar

This is a race. Under normal circumstances, here is what happens inside the Device class (src/hardware/nmclient.py):

  1. init asynchronously fires off a getProperties request over dbus
  2. Some time later, the getProperties request completes and _update_reply_cb is called with the results
  3. _update_reply_cb calls _update_networks which unconditionally creates Network objects for all networks that were returned

In the problematic case that we see here, the following happens:

  1. init asynchronously fires off a getProperties request over dbus
  2. NetworkManager generates NetworkAppeared messages for some networks, causing network_appeared() to be invoked and create a corresponding Network object
  3. Some time later, the getProperties request completes and _update_reply_cb is called with the results
  4. _update_reply_cb calls _update_networks which unconditionally creates Network objects for all networks that were returned

So, we end up with the a Network object for the same WLAN being created in steps 2 and 4.

We are not really sure why NetworkManager generates NetworkAppeared messages at this early stage. Does it always generate NetworkAppeared messages for all networks that are listed in the config file? Does it see a hidden network and assume that it corresponds to an entry in the config file? Given that we soon want to upgrade to NM-0.7 (which will shake up all this code anyway), we don't think it's worth investigating. The NM-0.6 source code did not offer any immediately obvious answers.

I see 2 easy ways of fixing this, patches attached:

  1. 0001-Avoid-network_appeared-race.patch: Ignore all NetworkAppeared messages until we've finished initializing
  2. 0001-Avoid-adding-already-appeared-networks-twice.patch: In _update_networks(), check that we didn't already create the Network earlier

I can't decide which one I like best. Possibly (1), except I'm not 100% sure on the assumption I detail in the comments.

Just to clarify, these patches are either-or: pick one or the other, not both. I tested both separately, both independently solve the bug.

Changed 6 years ago by cscott

  • cc marco, erikos added
  • keywords r? added

I prefer the first patch, fwiw.

Changed 6 years ago by marco

mtd, would be nice if you could review this one, since you are the only one familiar with that code.

Changed 6 years ago by mtd

  • keywords r+ added; r? removed
  • next_action changed from review to testcase

r+ on the first patch. The second patch would work just as well modulo s/return/continue/, but for some reason I like the first better.

Changed 6 years ago by dsd

  • next_action changed from testcase to package

pushed to master and sucrose-0.82

Changed 6 years ago by marco

  • next_action changed from package to test in build

Changed 6 years ago by marco

  • next_action changed from test in build to approve for release

The patch does not seem to have negative side effects in joyride. We have never seen the twice AP bug here, so we are unable to verify that it's actually fixed. I think we should get it into the stable build and then move to "qa signoff".

Changed 6 years ago by gregorio

  • cc wad, kimquirk added

Hi Guys,

We can't verify this one? Bummer. We are doing this because Peru specifically complained about this problem. Do we think we can ask them to try it?

It would be embarrassing if we ask them to do that and its not fixed. On the other hand if they can reliably reproduce the problem and can show that this resolves it they may appreciate that we are responsive to their input.

If the code is in a weekly release which passes smoke test, let's get Kim's OK to ask them to consider this test and to look at the whole image for comment.

Please take the extra step of synching us up internally on who will communicate and work on this with the deployment before reaching out to them. We don't want a flood of different people and messages going out in an uncoordinated way.

Thanks,

Greg S

Changed 6 years ago by marco

Gregorio, my idea was to ask the QA team to verify it, once it's in a stable build. Does that make sense?

Changed 6 years ago by gregorio

If they can reproduce it and prove its fixed that would be my first choice.

Thanks,

Greg S

Changed 6 years ago by dsd

To reproduce at 1cc: clean install, connect to media lab network, reboot. Seems to work every time during my testing, before the fix. You can verify the fix is working by adding a log message in the race-avoidance path.

Of course, we have no way of actually knowing that they hit this exact bug in peru - we just know they saw the same symptoms. And I'd say it's likely the same thing.

Changed 6 years ago by gregorio

Just to be clear I meant that if QA (Joe) can reproduce that's my first choice.

Thanks,

Greg S

Changed 6 years ago by joe

  • status changed from new to closed
  • resolution set to fixed

Tested with 8.2-759. No problem found, will close this ticket.

Joe

Changed 6 years ago by joe

  • cc joe added
Note: See TracTickets for help on using tickets.