Ticket #2412 (closed defect: fixed)

Opened 7 years ago

Last modified 6 years ago

Buddies not disappearing even when connection does

Reported by: Zack Owned by: morgs
Priority: blocker Milestone: Update.1
Component: presence-service Version:
Keywords: collaboration Update.1? review+ Cc: Zack, cjb, hughsie, morgs, smcv, gdesmott, jg, ypod, daf
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

I was at a meeting at 1CC and had 20ish XOs in my neighborhood screen. I suspended the XO and went home. I have no network connection on the XO right now, but none of the buddies have disappeared. This could be the same as #2308, but I'm reporting it anyway just in case.

Change History

  Changed 7 years ago by Zack

build 530.

  Changed 7 years ago by smcv

  • owner changed from morgan to morgs

  Changed 7 years ago by coderanger

Adding reporter to CC list

  Changed 7 years ago by coderanger

  • cc Zack added

Adding reporter to CC list

  Changed 7 years ago by daf

Were you connected to the mesh or to an AP while at 1cc?

  Changed 7 years ago by Zack

When I left I think I was on an AP.

  Changed 7 years ago by jg

  • cc changed from , Zack to Zack
  • priority changed from high to blocker
  • milestone changed from Untriaged to Trial-2

This could be pretty bad, particularly in a school of 200 kids.

  Changed 7 years ago by jg

  • cc cjb, hughsie added
  • milestone changed from Trial-2 to Trial-3

Sounds like lidswitch needs to tell OHM to tell NM to reassociate and tell Sugar.

The ankle bone is connected to the thigh bone is connected to the knee bone is connected to the hip bone.

  Changed 7 years ago by marco

  • owner changed from morgs to dcbw
  • milestone changed from Trial-3 to Untriaged

Reassigning to dcbw since it seem like OHM/NM interaction. I'm not sure if the presence service would cleanup stuff correctly on NM reconnection though.

  Changed 7 years ago by jg

  • milestone changed from Untriaged to Trial-3

There should probably be some delay before telling NM to reassociate.

Here's the reasoning: a teacher in a classroom may tell the kids "close the laptops" to get their attention. Yet we want instant on to work when the kids resume.

I'd make the timeout just over an hour.

  Changed 7 years ago by smcv

  • keywords collaboration added

  Changed 7 years ago by daf

There's a patch somewhere to Loudmouth to add TCP keepalive/timeout stuff. We could try fishing this out and applying it to our builds.

  Changed 7 years ago by kimquirk

  • milestone changed from Trial-3 to First Deployment, V1.0

Let's try to get to this for FRS.

follow-up: ↓ 15   Changed 7 years ago by BigBaaadBob

I have a situation that presents similar symptoms with a different stimulus.

Procedure: Clean-flash an XO with 542.3 (using, say, X+Square game buttons). After the XO is authorized and booted up, turn it off (using shutdown). Then insert a USB Ethernet adapter (on a DHCP network if that matters) and power on the laptop. The neighborhood screen will show around 20 neighbors that don't exist and can't be gotten rid of.

Even before this gets fixed, it would be nice if there were a manual way to fix this by getting rid of the stale neighbors somehow.

in reply to: ↑ 14   Changed 7 years ago by BigBaaadBob

Replying to BigBaaadBob:

I have a situation that presents similar symptoms with a different stimulus.

Never mind. The sugar default is set to connect to jabber.laptop.org. Removing that makes this situation disappear. Sorry.

  Changed 7 years ago by gdesmott

  • cc mors, smcv, gdesmott added

  Changed 7 years ago by kimquirk

  • owner changed from dcbw to smcv

  Changed 6 years ago by smcv

  • cc morgs, jg added; mors removed

[Adding jg, kimquirk to Cc since we need an opinion on what the policy ought to be - please Cc UI people if needed]

Gabble has a TCP connection to the server. When that TCP connection times out, it'll signal disconnection. There are really several things going on here:

Issue 1: Does the Presence Service respond correctly to disconnections? - does it attempt to reconnect to Gabble? does it bring up Salut? does it signal that all buddies went offline? I'll have a look at the PS and see what it actually does, and whether we have a bug here.

Issue 2: In an ideal world, how soon should a loss of connectivity cause a disconnection to be signalled by the CMs (and hence the above behaviour), i.e. how hard should we try to detect timeouts? This is a policy question on which we need input from UI/platform designers - jg's comment "I'd make the timeout just over an hour" conflicts rather with any attempt to detect disconnections immediately!

Issue 3: How soon *does* a loss of connectivity cause a disconnection to be signalled by the CMs? Does it match (2)? If the conclusion of (2) is that we should try harder, possible things we could do include TCP keepalives as daf suggested, forcibly disconnecting Gabble in response to NetworkManager signals, etc.

  Changed 6 years ago by jg

  • cc ypod added

In general, you should *not* try to detect failures....

Reporting connection failures will, in fact, just cause more failures that *do not matter* to end users, as well as pollute the network with unneeded transactions. Do *not* use TCP keep-alives. They are *evil*. People should only get failures which matter; reporting or forcing failures in the face of an unreliable network in advance of actual need just causes trouble.

All we really care about is that when the network state changes, or a long time has passed, we will eventually clean up buddies. So I would pay attention to NetworkManager signals, and forcibly reconnect if NM tells you the network state has changed.

Note that we'll be able to get much better information from the wireless about presence in the future; the Marvell chip keeps quite a bit of information (which may or may not currently be available) that can help, along with algorithms that Polychronis has been working with and put into his activity (a good place to look for ideas and advice). He can give immediate advice of what is possible.

follow-up: ↓ 21   Changed 6 years ago by smcv

morgs, could you please check the Presence Service code? Don't necessarily fix anything right now (though that'd be good) but it'd be useful to know the status.

- When connections signal that they have become disconnected via the Telepathy API, ensure that we remove the relevant handle from all buddies on that connection, causing them to emit BuddyDisappeared if they are not visible on any other connection

- When Gabble becomes disconnected, ensure that we bring Salut back up

- When N-M reports a disconnection, do we forcibly close Gabble connections?

If the answer to the second point is "no", I'm not convinced we should. It's entirely possible that after a transient disconnection + reassociation we'd come back online with the same IP address and our connections would be unaffected, so perhaps instead of forcibly closing Telepathy connections when N-M says we went offline, we should set some arbitrary timer and forcibly close the connection when it expires, unless we've gone back online with the same address in the meantime.

It probably makes sense to close Gabble connections immediately if our IP address changes from one non-empty value to another, though, since that means we lose continuity anyway.

Salut has its own internal logic for presence advertisements (and hence buddies' presence) timing out (it uses the recommended mDNS TTLs) so it never makes sense to kill the Salut connection due to disconnection. It also makes sense to have Salut running even when we're not on any network - Avahi does the right thing in any case, and the Salut "connection" being "connected" just means Salut is talking to Avahi.

in reply to: ↑ 20 ; follow-up: ↓ 22   Changed 6 years ago by morgs

Replying to smcv:

morgs, could you please check the Presence Service code? Don't necessarily fix anything right now (though that'd be good) but it'd be useful to know the status. - When connections signal that they have become disconnected via the Telepathy API, ensure that we remove the relevant handle from all buddies on that connection, causing them to emit BuddyDisappeared if they are not visible on any other connection

Looks like we don't.

- When Gabble becomes disconnected, ensure that we bring Salut back up

We do.

- When N-M reports a disconnection, do we forcibly close Gabble connections?

We do self._conn[CONN_INTERFACE].Disconnect() (wrapped in an evil try: except: pass)

If the answer to the second point is "no", I'm not convinced we should. It's entirely possible that after a transient disconnection + reassociation we'd come back online with the same IP address and our connections would be unaffected, so perhaps instead of forcibly closing Telepathy connections when N-M says we went offline, we should set some arbitrary timer and forcibly close the connection when it expires, unless we've gone back online with the same address in the meantime. It probably makes sense to close Gabble connections immediately if our IP address changes from one non-empty value to another, though, since that means we lose continuity anyway.

I agree - but we don't do anything currently in this case.

Salut has its own internal logic for presence advertisements (and hence buddies' presence) timing out (it uses the recommended mDNS TTLs) so it never makes sense to kill the Salut connection due to disconnection. It also makes sense to have Salut running even when we're not on any network - Avahi does the right thing in any case, and the Salut "connection" being "connected" just means Salut is talking to Avahi.

AFAICT Salut always runs when gabble doesn't, assuming it can see avahi on the system bus. Only gabble starting stops salut.

in reply to: ↑ 21 ; follow-up: ↓ 23   Changed 6 years ago by smcv

  • owner changed from smcv to morgs

Replying to morgs:

Replying to smcv:

- When connections signal that they have become disconnected via the Telepathy API, ensure that we remove the relevant handle from all buddies on that connection, causing them to emit BuddyDisappeared if they are not visible on any other connection

Looks like we don't.

OK, please implement that on a branch. Ping me for review.

- When N-M reports a disconnection, do we forcibly close Gabble connections?

We do self._conn[CONN_INTERFACE].Disconnect() (wrapped in an evil try: except: pass)

except dbus.DBusException, e: some_logger_or_other.debug('%s Disconnect(): %s', conn.object_path, e) please :-)

It probably makes sense to close Gabble connections immediately if our IP address changes from one non-empty value to another, though, since that means we lose continuity anyway.

Lower priority but yes please.

AFAICT Salut always runs when gabble doesn't, assuming it can see avahi on the system bus. Only gabble starting stops salut.

Great, as long as Gabble stopping starts Salut, we're doing the right thing in all cases then.

While you're touching this code already, it might be useful to incorporate some more debug so we can get some idea what's happening in the various "I'm on an AP but Gabble isn't connected" bugs. Have a look at Yani's bug?

in reply to: ↑ 22 ; follow-up: ↓ 25   Changed 6 years ago by morgs

Replying to smcv:

Replying to morgs:

Replying to smcv:

- When connections signal that they have become disconnected via the Telepathy API, ensure that we remove the relevant handle from all buddies on that connection, causing them to emit BuddyDisappeared if they are not visible on any other connection

Looks like we don't.

OK, please implement that on a branch. Ping me for review.

Actually, _buddy_validity_changed_cb calls BuddyDisappeared so it does work.

follow-up: ↓ 26   Changed 6 years ago by smcv

It occurs to me that for completeness, we should watch each connection's well-known name using the Bus.watch_name_owner() method; if it becomes unowned (because the CM crashes), we should treat that just like DISCONNECTED.

in reply to: ↑ 23   Changed 6 years ago by smcv

Replying to morgs:

Actually, _buddy_validity_changed_cb calls BuddyDisappeared so it does work.

Yes, but when is _buddy_validity_changed_cb called? Does a disconnecting CM cause it to be called?

This can be tested by using the Terminal activity to call Disconnect() on the connection, using dbus-send. PS should notice the disconnection.

For the suggestion in my previous comment, you can test by simulating a crash: find out the pid of telepathy-gabble (e.g. 12345) and kill it with SIGSEGV or SIGABRT.

in reply to: ↑ 24 ; follow-up: ↓ 27   Changed 6 years ago by bert

Replying to smcv:

It occurs to me that for completeness, we should watch each connection's well-known name using the Bus.watch_name_owner() method; if it becomes unowned (because the CM crashes), we should treat that just like DISCONNECTED.

Does this affect an activity that does not use the Python API? I mean, who has to watch this? If this is the responsibility of the activity, where are all the responsibilities documented?

in reply to: ↑ 26   Changed 6 years ago by morgs

Replying to bert:

Replying to smcv:

It occurs to me that for completeness, we should watch each connection's well-known name using the Bus.watch_name_owner() method; if it becomes unowned (because the CM crashes), we should treat that just like DISCONNECTED.

Does this affect an activity that does not use the Python API? I mean, who has to watch this? If this is the responsibility of the activity, where are all the responsibilities documented?

No, this is Presence Service watching the Telepathy connection managers :)

  Changed 6 years ago by smcv

morgs: ping? How are you doing on this? Is there any code? Can I review it?

  Changed 6 years ago by daf

  • cc daf added
  • keywords Update.1? added

There's a patch in Loudmouth git which turns on TCP keepalives when available. It sends out a TCP keepalive packet when the connection is idle for 30 seconds. Keepalives are sent out every 30 seconds. After 3 keepalives time out, the connection is closed. This would mean that the connection would be closed after 2 minutes if the link goes down. I'm not sure how the timeouts interact with suspend/resume.

Watching network manager might still be a better solution, but adopting a newer Loudmouth version might ameliorate this bug.

(Loudmouth is a library that Gabble uses.)

  Changed 6 years ago by jg

keep-alives are evil.

Do the math.

Say we have 1000 students in a school, and everyone does this every 30 seconds.

That's lots of packets/many per second.

Until we have better mechanisms, (e.g. Polychonis' presence stuff), we have to tolerate much longer terms of absence, and cannot be probing the network like this.

What is more, on a lossy network, using keepalives will cause many "false positives"; detecting transient failures that don't matter just causes grief.

  Changed 6 years ago by morgs

  • keywords review? added
  • milestone changed from Update.2 to Update.1

  Changed 6 years ago by smcv

  • keywords review- added; review? removed

Patch improvements requested on IRC, morgs is working on it.

  Changed 6 years ago by smcv

  • keywords review+ added; review- removed

Approved, following IRC discussion and several revised patches.

  Changed 6 years ago by jg

I trust this is *not* using keep-laives. They will just get us into other trouble......

  Changed 6 years ago by smcv

  • status changed from new to closed
  • resolution set to fixed

jg: No, the keepalives are a Loudmouth feature (the XMPP library Gabble uses) and we haven't proposed changing the version of Loudmouth shipped on the OLPC.

daf: In the LM version that has keepalives, is there API to enable or disable them? We could expose that as a boolean ConnectionManager parameter, so they're enabled on Maemo (I assume that's where the patch came from) and disabled on OLPC.

The fix actually committed (to the Presence Service) was to track the life cycle of the connection managers (to spot crashes), and forcibly disconnect Gabble if our IP address changes (since the TCP connection can't possibly keep working in that case). This is now in a PS build in Koji, so it should land in Joyride soon, if it hasn't already.

  Changed 6 years ago by jg

Cool. Thanks for the explanation.... Just say after me: "keep alives are evil"...

http://www.rfc-archive.org/getrfc.php?rfc=1122 section 4.2.3.6, particularly the first paragraph of the discussion.

  Changed 6 years ago by daf

The current situation is as follows. Loudmouth has a facility for sending keepalives that operates by periodically sending a single space to the server. Gabble configures Loudmouth to send this keepalive every 30 seconds that we're idle.

The rationale for this approach was:

  • Presence is time-sensitive, so detecting network failure is important. In practice, this doesn't really work very well with the whitespace approach, because the TCP timeout is generally 2 hours anyway. Using TCP keepalives supports detecting disconnections better.
  • Gabble is sometimes used behind NATs with aggressive timeout policies. Periodically sending data keeps the connection mapping alive.

Note that a failure on the local link is not the same as a failure in the connection to the Jabber server. If the Jabber server falls over, it will still be possible for packets to get to it, but we'll never get any replies.

Adding a 'keepalive' parameter to Gabble could be a good compromise.

I don't think we should worry too much about this right now; if we are concerned about Gabble's bandwidth usage then there are much worse offenders to go after.

  Changed 6 years ago by daf

(By a keepalive parameter, I meant a boolean, so that it could be turned off entirely if we decide that it's not something that we want at all.)

  Changed 6 years ago by jg

daf,

This means you will detect failure even if the network event is transient. So you will end up, at an application layer, provoking failures and exceptions *much* *much* more often than if you do not.

In a busy network environment, (say 1000 machines to the server), this is 30 PPS. I don't know what the bandwidth works out to, but it is not just one byte/packet; it is the entire size of the packet plus the size of the wireless preamble/postamble.

We certainly don't need to know that someone goes away within 30 seconds.

Keepalives are evil....

Note: See TracTickets for help on using tickets.