Ticket #5501 (reopened defect)

Opened 6 years ago

Last modified 6 years ago

XOs vanishing/reappearing/flashing in the mesh view is related to avahi losing its cache of hosts

Reported by: yani Owned by: sjoerd
Priority: high Milestone: 8.2.0 (was Update.2)
Component: telepathy-salut Version:
Keywords: relnote 8.2-762:- blocks?:8.2.0 Cc: sjoerd, gdesmott, morgs, marco, jg, krstic, daf, yani, carrano, mbletsas, gregorio
Action Needed: reproduce Verified: no
Deployments affected: Blocked By:
Blocking:

Description

The cache of hosts of avahi deamon is rechecked every 1-2 minutes. When a host in cache cannot be resolved, the entry is not deleted, but instead restated as "Failed".

If a new host arrives, and 1(other than the arrived one) or more hosts were previously reported "Failed", then Avahi loses this cache.

The result is that all "Failed" XOs disappear from the mesh view instantly.

The best and simplest way to recreate the effect in any environment(noisy or not) is to: 1.Connect successfully 3 XOs in the same mesh. 2.Move successfully XO1,XO2 to another channel., and verify the show as "failed" when running "avahi-browse" in XO3 3.Reconnect at the same time XO1,XO2 to the initial channel. 4.While the XOs are trying to connect(30sec) check they still show are "Failed" when running "avahi-browse" in XO3 5.Observe the screen in XO3: the icons of XO1,XO2 will jump almost at the same time.

To my best understanding, It is not related to a noisy envirnment Does not require a large number of laptops Can be recreated in 100% of the times you try the above.

Attachments

avahi_poof.patch (2.3 kB) - added by sjoerd 6 years ago.
avahi.W01.5501.log (56.0 kB) - added by yani 6 years ago.
dbusW01.5501.log (0.6 MB) - added by yani 6 years ago.
tcpdump.W01.5501.out (12.1 kB) - added by yani 6 years ago.
logs.CSN7440001C.2008-01-31.08-36-10.tar.bz2 (52.2 kB) - added by yani 6 years ago.
screenshots.zip (123.5 kB) - added by yani 6 years ago.

Change History

  Changed 6 years ago by yani

  • cc sjoerd added

  Changed 6 years ago by jg

  • milestone changed from Never Assigned to Update.1

  Changed 6 years ago by gdesmott

  • cc gdesmott added

follow-up: ↓ 5   Changed 6 years ago by jg

  • cc morgs, marco, jg, krstic added

I wonder if this is exacerbating the presence service related bug that is causing our memory hemorrhage in the sugar-shell?

in reply to: ↑ 4   Changed 6 years ago by yani

Replying to jg:

I wonder if this is exacerbating the presence service related bug that is causing our memory hemorrhage in the sugar-shell?

No, as far as i am a aware of. This bug is present since ever. The first time i observed it was around 608 or before. It is related to how the avahi daemon handles new hosts. I believe it is a completely different issue.

  Changed 6 years ago by jg

heh. we believe the sugar-shell #5501 bug has been around since ever as well according to Michailis; but before the firmware started working better, and the presence code fixed in various ways, we never could have very many icons on the screen at once. Or that is my current theory...

  Changed 6 years ago by sjoerd

  • cc daf added

the MDNS spec describes a technique called Passive Observation of Failures. This allows failures to be detected before the ttl of a record runs out. The way this is done is by observer queries of others, if there are no reponses observed after various queries. The record can be assumed to be to have failed.

The way this was implemented in avahi is that a record is expunged from the case if there isn't a response within a second of the _second_ query for it then it's assumed to have failed.. When you introduce new nodes running salut in a network (or as in your example let them reconnect), they query the _presence._tcp PTR record. Which can easily trigger othernodes to observer certain records as failure. Which in turn explains why you see the nodes jump as soon as you introduce nodes in the network

I'll attach a patch for avahi that makes the code a bit more demanding in what it needs to see before it assumes it observed a failure. Instead of two queries. There need to be at least 4 queries and each query has to be at least one second apart. This should be enough to prevent most false-positives.. Please see if this helps on the OLPC network too

Daf is currently working on getting the patch into the joyride avahi package.

Changed 6 years ago by sjoerd

  Changed 6 years ago by gdesmott

  • owner changed from sjoerd to yani

Patch is in Joyride since build 1474. Could you try with it and report how it improves things?

  Changed 6 years ago by daf

I've made an Avahi build with Sjoerd's patch:

http://koji.fedoraproject.org/koji/buildinfo?buildID=29084

Please test it and tell us if it help with the problem.

  Changed 6 years ago by gdesmott

This patch is now in Update.1 688

  Changed 6 years ago by yani

  • cc yani, carrano, mbletsas added

The bug is still here even after the patch, but is a little different.

4XOs: W01,W02,W03,W05 were connected to ch11, and 02,03,05 left. After some time they showed as "Failed to resolve" in avahi-browse. Only W02 was returned to ch11. Then W03,W05 almost instantly vanished from the screen.

You might this helpful:

08:03 W02,W03,W05 moved to other channel(i think W05 was moved a little later)

08:25 by that time all gradually were reported as "Failed" by avahi-browse

08:26 I moved W02 back to channel 11

08:27 avahi-browse crashed(it was running continuously with a script) + avahi cache cleared + icons vanished from the screen

all W01 logs on the tarball avahi-browse in avahi.W01.5501(clean timestamped output) dbus-monitor output in dbusW01.5501 also a couple of timestamped screen shots to show that the icons vanished exactly at 08:27.

Changed 6 years ago by yani

Changed 6 years ago by yani

Changed 6 years ago by yani

Changed 6 years ago by yani

Changed 6 years ago by yani

  Changed 6 years ago by gregorio

  • owner changed from yani to sjoerd
  • next_action set to finalize

Sjoerd,

Can you write the release notes on this?

http://wiki.laptop.org/go/Release_Notes/8.2.0#Network-related_issues

Thanks,

Greg S

  Changed 6 years ago by gregorio

  • keywords relnote added
  • status changed from new to closed
  • resolution set to invalid

  Changed 6 years ago by sjoerd

I've updated the release note, with a quite small note, please let me know if you like me to expand it further

  Changed 6 years ago by mstone

  • keywords 8.2-762:- added
  • status changed from closed to reopened
  • next_action changed from finalize to reproduce
  • resolution deleted

The QA team reported seeing visual behavior similar to the visual behavior described in this ticket on 8.2-762. Could you please write up some testing notes so that we start puzzling out what's actually going on here? Other thoughts on how to approach this issue?

  Changed 6 years ago by mstone

  • keywords blocks?:8.2.0 added

  Changed 6 years ago by gregorio

  • cc gregorio added

Hi Sjoerd,

Is this the note you added? "it takes some time for the link-local presence information to stabilize."

I want to make it as user friendly as possible. Something like: when first going to the Neighborhood view the available XOs and activities may not appear correctly for a while (or may move around?). After nnn minutes all the available XOs and activities should be seen.

Let me know if you can explain it in terms like that which will tell non-technical user what to do.

Thanks,

Greg S

  Changed 6 years ago by gdesmott

What about something like that ?

When you are connected to a network or on the simple mesh without school/jabber server present, XOs and activities may not appear correctly for a while. After around 5 minutes all the available XOs and activities should be seen.

Note: See TracTickets for help on using tickets.