Ticket #6299 (closed defect: fixed)

Opened 6 years ago

Last modified 6 years ago

presence service should disable salut in the presence of school servers on mesh

Reported by: robot101 Owned by: Collabora
Priority: high Milestone: Update.1
Component: presence-service Version:
Keywords: release? Cc: jg, dwmw2, mbletsas, dcbw, cjb, mstone, wad, dgilmore, Collabora
Action Needed: Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

Currently the algorithm in presence service starts Salut automatically when avahi is available, and then when we get an IPv4 address from Network Manager, we start trying to connect with gabble in parallel, and disable salut if we succeed at connecting with gabble.

Salut makes mDNS queries to discover presence records when it;s running. Combined with flood-fill multicast, this has the result that the network is hugely clogged with mDNS resolution traffic, and actively prohibits other users from DHCPing, registering, or connecting to the XMPP server, causing them to in turn stay on mDNS, exacerbating the problem further still.

However, given that DHCP on the XOs is modified to unicast to a well known "anycast" MAC address on the school server, we can look for this MAC address to determine that a school server is present, and we should then shut down salut to stop *any* mDNS activity.

dwmw2 can confirm a way to determine that this MAC address is reachable, but it seems to be something along the lines of try and send traffic to it somehow, then inspect the mesh routing table to see if it appeared in there

The algorithm would then look more like:

stop salut if we're on mesh, and the school server is reachable by anycast:

while 1:

try to connect with gabble sleep

else:

connect with salut while 1:

try to connect with gabble if we succeed:

stop salut

else:

sleep

We should run this algorithm every time we arrive on a new network, and when we resume from sleep.

We need to decide three timeouts:

1) start of day: how long do we look for the school server anycast MAC before we decide we're OK to enable salut? 2) kids wander off/server dies: how long after we're disconnected from the school server before we decide its OK to enable to salut? 3) kids wander back/server resurrected: how often do we look for the school server anycast address when we're using salut, to decide we should switch salut off and wait for gabble to work instead?

Attachments

6299.patch (11.6 kB) - added by gdesmott 6 years ago.

Change History

Changed 6 years ago by gdesmott

dwmw2: could you tell us more about this MAC address? How can we find it? How test the connectivity? etc.

Changed 6 years ago by dwmw2

  • cc mbletsas added

I think the anycast MAC address is something like 00:17:c4:00:00:01, with the final three octets being set from userspace in sysfs. It should be documented in the libertas firmware manual, but I don't believe it is -- you could look in the dhclient configuration file or just sniff the network as it tries to do DHCP on the mesh.

You should be able to send an Ethernet packet to that MAC address, then use the fwt_lookup ioctl to see if you found a route to it.

Changed 6 years ago by gdesmott

  • cc dcbw added

First, I need to know more about this anycast address.

I read nm-device-802-11-mesh-olpc.c from NetworkManager and discovered the following things. Please tell me if I'm wrong.

It defines 2 anycast addresses: school_mpp_anycast and xo_mpp_anycast. I guess that's the first one we are looking for, right?

The default address for school_mpp_anycast is c0:27:c0:27:c0:00. It can be overridden with this file: SYSCONFDIR "/NetworkManager/anycast.conf". I don't have it on my XO's, is it deployement specfic?

Could I use the same algorithm to find the anycast address in the presence-service?

Changed 6 years ago by dcbw

The idea was that the anycast addresses could change if the local deployment had a need for them to change. The problem is that you don't necessarily know what the anycast address for the school server will be at the start, unless you want to parse SYSCONFDIR "/NetworkManager/anycast.conf", which seems pretty ugly.

A completely different solution would be to check if msh0 has a 169.254.*.* LL IPv4 address, and if _not_, stop doing any mDNS. NetworkManager will only assign a routable (well, non-LL IPv4 address) to the mesh interface when there is no school server found, which implies that the anycast stuff already failed. So a LL IPv4 address means "do mDNS", while a non-LL address means "don't do mDNS". With this solution you wouldn't have to do anything except run an SIOCGIFCONF ioctl, which is a *boatload* less code for you to write and a lot less stuff to go wrong. If you want to get really slick, you could either listen to D-Bus events from NM, or listen to netlink interface change events like Avahi does to find out when the interface's address changes, and when the interface goes up and down. While netlink isn't simple, it's pretty straightforward packet-parsing code once it's tied in with your event loop to read the netlink socket.

Changed 6 years ago by gdesmott

When my XO's are connected to my AP, the msh0 interface is still up and have a LL ip address. Is there a difference when a XO is connected to a school server and when it's connected to an AP?

Changed 6 years ago by mbletsas

Replying to gdesmott:

First, I need to know more about this anycast address.

The anycast address is there to provide a mechanism to select the instance of a "standard" service that is "closer" (is accessible via the lowest cost path) to the mesh node looking for that service.

The idea is that all "standard" XO services (the only one currently in use is MPP) listen to a predefined anycast address. So when a node needs to find if there is an instance of the required service running on the mesh, just tries to contact that service at the predefined anycast address.

How this can be done, is still a matter of programming convenience at this point. In the original implementation of the idea, we had all MPPs run a small daemon at a predefined IP. Then at the nodes looking for the MPP, we had a static ARP mapping between the predefined MAC address and the predefined IP. Thus, a simple ping to the predefined IP would reveal a) whether there is an MPP on the network and b) discover the lowest cost path to it (if there was more than one).

Then contacting a simple python server listening on the predefined IP, would reveal the real IP address of the MPP as well as the DNS server info.

To do the same with the current incarnation of the school server MPP or XO MPP (they differ in that the former assigns the IP address via DHCP where the later only serves GW and DNS info), we send the DHCP request to the anycast MAC address (as opposed to the broadcast one. In that manner, we ensure that it is the "closest" MPP that responds.

The same principle is applicable to every "standard" service that mesh nodes might want to offer to their peers. The presence services and its future ad-hoc supernodes are a prime example.

The original idea was to use just one anycast address and then return the higher level information via a "portmapper/rpc" type of info server running on the machine.

It doesn't take much thinking to see the obvious issues with that approach (multiple step process, lots of traffic), so the idea to have a range of configurable MAC addresses that are statically mapped to pre-defined services (that the node might be listening to, beyond its own address the standard multicast/broadcast addresses) was adopted.

The way to set which of those anycast addresses are active is described in this post:

M

Changed 6 years ago by robot101

Thanks Dan, this makes things significantly easier if we can run with this method. Presence service already listens to Network Manager for the IPv4 address appearing as a way of determining if we're online and its even worth trying to connect on XMPP. So, we can just wait for this address to appear and inspect it to decide whether we kick salut off, or decide to go into XMPP-only mode.

Two remaining questions:

1) How do we distinguish between if we have got an address on msh0 from a school server, or from XO mesh portal? Can NM export a property to give us a hint? Or does DHCP over msh0 /always/ mean School Server?

2) What happens to this heuristic if we turn off IPv4 and go IPv6 only?

Changed 6 years ago by dcbw

guillaume: when connected to an AP, no anycast mesh discovery is done (because the mesh interface is not the primary interface, and therefore your default route doesn't go through the mesh but through the AP instead). I'm not sure if people want mDNS in this case or not.

robot1010: In reality, the PS itself is the only thing that knows when you're connected or not to the school server. So the PS is definitely in a position to decide whether or not to turn of Salut. I think the logical place for this functionality is probably in the PS itself.

There's a small race where, if the Gabble plugin can't contact the server, what to do about mDNS? You may want to keep Salut/mDNS enabled until Gabble connects, then turn it off. If the Gabble connection drops, keep a timer around for ~1m or ~2m or something and if Gabble doesn't reconnect within that time, start up Salut.

Changed 6 years ago by jg

OK, so far, so good. How do we handle the case of a classroom of students arriving in the morning and unsuspending?

An address may be left over on the mesh interface, and therefore we make the wrong choice.

It seems to me that either OHM, NM or the presence service should check for a school server if the machine has been suspended for a while (a while being open for discussion....).

Changed 6 years ago by morgs

  • cc cjb added

Changed 6 years ago by gdesmott

So basically what we have to do is keep the same policy as currently (turn Salut off when Gabble gets connected) and add the following rule:

if msh0 got a not LL IP:
    turn off Salut
    if Gabble isn't connected in 2 minutes:
        restart Salut

Is that right?

Changed 6 years ago by dcbw

Yeah, seems good.

Changed 6 years ago by jg

  • milestone changed from Never Assigned to Update.1

And what is causing the mesh to get a non-LL IP after a sleep (children arriving in the morning, and opening their laptop or pressing the power button)?

Changed 6 years ago by gdesmott

Changed 6 years ago by gdesmott

Changed 6 years ago by gdesmott

  • keywords Update.1?, review? added; Update.1? removed

Attached a fix proposal for this bug. Should be reviewed by Morgs soon. See my git branch for smaller/easier to review patches.

I didn't fully test this fix as I don't have access to a school server.

Changed 6 years ago by mstone

  • cc mstone added

Changed 6 years ago by morgs

  • keywords review+ added; review? removed

Changed 6 years ago by morgs

Tested in jhbuild and by directly patching PS on XOs on build 691 (to the extent that I can test this without an actual schoolserver).

sugar-presence-service-0.75.2-1.olpc2 currently in koji for the next joyride.

Changed 6 years ago by morgs

Pippy will fail to launch while salut and gabble are both off: #6475.

Changed 6 years ago by morgs

  • cc wad added

joyride-1701 has sugar-presence-service-0.75.2-1.olpc2 and the fixed Pippy.

We need someone at 1cc to test whether that build does indeed notice the schoolserver and keep salut off for 2 minutes while it tries to connect to the jabber server. You would need some laptops on salut on the appropriate mesh channel so you can see their absence running this build, and you would need a schoolserver running the jabber server. Also, try it with the jabber server not running so you can verify the salut buddies come back after 2 minutes (the time we disable salut).

Changed 6 years ago by daf

I'm not sure the two minute timeout is a good idea. The problem is that if the jabber server goes down for a couple of minutes, everybody will start up Salut and then not switch back to Gabble.

Changed 6 years ago by morgs

I see Dennis tagged this into Update.1:

On Wed, Feb 27, 2008 at 11:30 PM, Koji Build System <buildsys@…> wrote:

Package: sugar-presence-service NVR: sugar-presence-service-0.75.2-1.olpc2 User: ausil Status: complete Tag Operation: tagged Into Tag: olpc2-update1 sugar-presence-service-0.75.2-1.olpc2 successfully tagged into olpc2-update1 by ausil

Was this actually tested with an actual schoolserver yet?

Changed 6 years ago by daf

Not yet; we are planning to do it soon. We will post the results of our testing here.

Changed 6 years ago by morgs

  • cc dgilmore added

Dennis, why was this (sugar-presence-service.noarch 0:0.75.2-1.olpc2) tagged into Update.1, when (a) it wasn't tested yet with an actual schoolserver, (b) it breaks Pippy unless Pippy-19 is included (#6475) which wasn't, and (c) it was never approved for update?

Anyway, as a result, this is now in Update.1-695 so we need to either revert or test.

Changed 6 years ago by dgilmore

morgs it was tagged into update.1 because i was told to do so for the testing that is going on in 1cc this week. the build was also intentionally not announced for the same reasons.

Changed 6 years ago by cjb

  • owner changed from Collabora to dgilmore

Dennis: we'll still need to make a new build with Pippy-19 in before anyone can deploy these builds, so please kick off a new build including Pippy-19 (in case Wad and Walter need to use our latest Update.1 code in Peru).

Changed 6 years ago by daf

Perhaps a better approach than a timeout would be to enable Salut after Gabble has tried and failed to connect 3 times.

Changed 6 years ago by morgs

  • cc Collabora added

Changed 6 years ago by gdesmott

The backoff timeout is reseted (to 5 seconds) when the IP is changed. So, if Gabble connection failed, it will wait 5 seconds, then 10 seconds, 20 seconds... before the next attempt.

Maybe that's enough (I guess that's depend on the length of each connection attempt).

Changed 6 years ago by morgs

  • owner changed from dgilmore to ApprovalForUpdate

cscott wants to get this in another Update.1 build for testing at 1cc.

Please approve for dgilmore to tag in sugar-presence-service.noarch 0:0.75.2-1.olpc2.

Changed 6 years ago by kimquirk

  • owner changed from ApprovalForUpdate to dgilmore

Approved.

Changed 6 years ago by dgilmore

  • owner changed from dgilmore to robot101

Please Test Build 699

Changed 6 years ago by morgs

  • owner changed from robot101 to kimquirk

Kim, we need someone at 1CC with access to a schoolserver to test this fix. None of us at Collabora have access to a schoolserver. I've asked several times on the devel list but I don't know who to assign this to.

Please assign to somebody - this is now in Update.1 (again) despite it never having been tested.

Here are my testing instructions from a month ago:

"We need someone at 1cc to test whether that build does indeed notice the schoolserver and keep salut off for 2 minutes while it tries to connect to the jabber server. You would need some laptops on salut on the appropriate mesh channel so you can see their absence running this build, and you would need a schoolserver running the jabber server. Also, try it with the jabber server not running so you can verify the salut buddies come back after 2 minutes (the time we disable salut)."

We need at least that tested so we know if the fix does what is intended. Then we need this tested on a larger testbed so we know if it makes a significant difference to the laptops trying to connect to jabber. However the initial small-scale test is needed urgently.

Changed 6 years ago by wad

The information needed to verify this is almost available at: http://wiki.laptop.org/go/Collaboration_Network_Testbed#Test_0317E

This is a trace of 29 registered laptops turning on with a school server present. There is still mDNS traffic from each laptop, but very little.

Contrast this with: http://wiki.laptop.org/go/Collaboration_Network_Testbed#Test_0317A where there was no school server, and each laptop generates much more Salut traffic.

Can Collabora check these packet traces to verify ?

I'm planning to get a deeper packet trace this afternoon, which will allow us to inspect the contents of the mDNS traffic in case 0317E.

Changed 6 years ago by Blaketh

  • keywords review? added

Changed 6 years ago by Blaketh

  • keywords release? added; review? removed

Changed 6 years ago by morgs

wad's tests do include some presenceservice logs. Many of them show an IP address of 169.x.x.x and do not therefore "detect" the school server - that's a DHCP issue out of the scope of this fix.

However, http://xs-dev.laptop.org/mesh/test0323/0323Bx52logs/logs/1206404683/presenceservice.log shows a log where the IP address is 172.18.11.250, we do see "DEBUG s-p-s.linklocal_plugin: Connected to a school server. Disable Salut" and "DEBUG s-p-s.presenceservice: Gabble takes precedence, disconnect Salut" - so this fix is working.

I think that's sufficient to close this now, any comments?

Changed 6 years ago by morgs

  • keywords Update.1?, review+ removed
  • owner changed from kimquirk to Collabora
  • verified set

Guillaume concurs, so closing.

Changed 6 years ago by morgs

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.