Opened 7 years ago

Last modified 6 years ago

#6884 new defect

Incorrect number of laptops shown in neighborhood view

Reported by: wad Owned by: Collabora
Priority: high Milestone: 8.2.0 (was Update.2)
Component: presence-service Version:
Keywords: Cc: Collabora, martin.langhoff, mstone, sascha_silbe
Blocked By: Blocking:
Deployments affected: Action Needed: diagnose
Verified: yes

Description

Using normal WiFi through an access point to connect to a school server, a number of laptops running gabble and connected to the school presence service had different numbers of user displayed on their screen.

For example, in a recent test three adjacent laptops showed 44, 36, and 23 laptops in their respective neighborhood views. The correct number (according to both manual calculation and the ejabberd server) was 45.

We are not referring to the case where NO other laptops are shown on the screen (#6883), although it may be related.

Laptops running 703 build, school server running build 160.

Logs and packets traces, as well as more description of the test setup, can be found at:
http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests#Test_0410C

Change History (19)

comment:1 Changed 7 years ago by gdesmott

Are you sure all these laptops were properly connected to the jabber server (and so were running Gabble and not Salut)?
A first start to help us to debug this would be to check if there are differences between the laptops displayed on the mesh view and the buddies list from the PS debugger (from the Analyze activity).

It could also be a problem on the Jabber server. Which "groups" are populated on it? There was plan to add a "Online" and "Random" group but I don't know if these patches were applied or not.
Daf: Can you tell us more about this?

comment:2 Changed 7 years ago by gdesmott

I wrote a simple script listing all the buddies known by the PS. Could help to debug this kind of bug. See #6918

comment:3 Changed 7 years ago by gdesmott

  • Cc Collabora added; olpc@… removed

I reproduced this kind of bugs using sugar-jhbuild and hyperactivity connecting 30 accounts. According logs, the jabber server doesn't send us presence from a lot of connected accounts (but it does to other XO's). So that's definitely an ejabberd bugs. Would be good if a P1 guy could take a look on it.

comment:4 Changed 6 years ago by marco

  • Keywords 8.2.0:? added
  • Milestone changed from Never Assigned to 8.2.0 (was Update.2)

comment:5 Changed 6 years ago by marco

  • Action Needed set to diagnose
  • Keywords 8.2.0:? removed

comment:6 Changed 6 years ago by gregorio

  • Priority changed from normal to high

comment:7 Changed 6 years ago by martinlanghoff

  • Cc martin.langhoff added

Collabora - Any news on this?

comment:8 Changed 6 years ago by gdesmott

  • Cc mstone added

As said, this problem is really an ejabberd bug. I requested help but no one seems to know if there are still ejabberd developpers working with us or not.

comment:9 Changed 6 years ago by mstone

When this ticket was introduced, the XS was based on ejabberd-1.x. F-7 now ships ejabber-2.0.1 by default. (I can't find the package installation logs for the XS livecds so I'm not sure what version of ejabberd they're going to install by default.)

Question 1: Has the version of ejabberd in the XS changed since this was last tested? (If yes, then we need to retest this. Packet traces should be collected.)

Question 2: Is ejabberd sending plausible results out through the ether? (If we lack adequate data to answer this question, then more data must be collected.)

Question 3: Is telepathy-gabble receiving enough results or not? Is it discarding any results that it receives?

Question 4: If ejabberd is faulty, what can be done to debug the issue? In particular, is an Erlang expert (from Process One or otherwise) available to help? If not, should we train someone to speak Erlang?

Question 2: Is ejabberd at fault?

comment:10 Changed 6 years ago by martin.langhoff

Hi Michael,

all the recent builds (from before I even joined) have ejabber-2.0.0 with local patches. WE are getting reports from the field with recent builds that fit this bug description to a t.

ISRT Wad mentioning ejabberd sending the wrong listing over the ether when he traced it, but we need to confirm that with him.

comment:11 Changed 6 years ago by gdesmott

Martin: If you could attach gabble logs and/or tcpdump that would be really useful.

comment:12 follow-up: Changed 6 years ago by robot101

This bug seems veeeeeeeeeery familiar. We had it very early on during our development cycle when we used the @all@ shared roster group, meaning every single registered user on the server. The reason was an interaction between in-band registration, and shared rosters. Each client thread inside ejabberd filled their roster with a /copy/ of all the registered users when they connected, but didn't get this list updated when new people registered on the server. So even though the newly-registered user had the currently signed-on people on their list, the client thread of the existing user doesn't think the new user is eligible to see the current user's presence, because they're not on their old copy of the list, so doesn't reply to the presence probe. The new user therefore saw all currently-connected users as offline, until the currently connected users signed out and in again. The fix was a little hack from one of the ejabberd guys to push out a new roster item to all applicable shared roster group members when a new user was registered.

For the @online@ group, a similar failure would be if the people who just signed on weren't being correctly pushed onto the rosters of the already signed-on people. This means each person would see everyone who signed onto the server before them, but nobody who signed on after them. It's some time since I worked on it, but in my (inefficient) implementation of @online@, I definitely hooked the "newly connected user" and did some very crude production and updating of every other person's roster item for the newly connected user. It's possible that the Process One optimisations introduced a regression in this area.

Or perhaps it only applies in a corner-case, where a newly-registered person does appear for currently-signed on users, but an existing registration who signs back in does not. Can I get a full list of the patches that're applied on the servers which exhibit this issue?

comment:13 in reply to: ↑ 12 Changed 6 years ago by gdesmott

Replying to robot101:

For the @online@ group, a similar failure would be if the people who just signed on weren't being correctly pushed onto the rosters of the already signed-on people. This means each person would see everyone who signed onto the server before them, but nobody who signed on after them. It's some time since I worked on it, but in my (inefficient) implementation of @online@, I definitely hooked the "newly connected user" and did some very crude production and updating of every other person's roster item for the newly connected user. It's possible that the Process One optimisations introduced a regression in this area.

I manually rebased the recent_online_and_nearby_groups patch (which should be latest P1 patch according to Daf) to apply on ejabberd 2.0.1 (some part of the patch were merged upstream) and we observed exactly the issue you described with the @online@ group.

comment:15 Changed 6 years ago by wad

Regarding gabble logs of the problem, see the original ticket text, it has a link to the logs.

Regarding the patches applied to the server on which we were seeing this, it was the ejabberd-2.0.0-0.1.beta1.olpc RPM prepared by daf at collabora.

Looking at the information at the link with the logs, I assert that the missing laptops are not those that connected after a laptop obtained roster information. In those tests, laptops were turned on sequentially. The three laptops for which we wrote down seen laptops show widely varying numbers, and it doesn't seem correlated at all with the order the laptops were turned on. However, this is not conclusive.

Guillaume, does your patch merely reproduce the problem ? Or does it fix it ?

comment:16 Changed 6 years ago by gdesmott

Wad: With my patch we observed issues that looked very similar as the problem Robert described. We are investigating this.

comment:17 Changed 6 years ago by gdesmott

I tested 4 versions of the patch. The first is a combination of the online, push2 and recent patches. The second is a "all in one" patches provided by P1. The third is a small updated version of the second. And the fourth is a manually rebased version of the 3rd to apply on ejabberd 2.0.1 (the others were for 2.0.0 beta1).

The 3rd patches is the one currently used by XS and the fourth is used in my updated ejabberd package [1].

[2] contains info about the different patches. For each patch:

  • global.patch contains all the changes of the patch
  • a html file containing the result of my tests
  • a diff showing the differences between the patch and the previous one.

According to my tests, none of these patches are working properly.
When a new user is created it's *maybe* added to the roster of connected users (seems that's more often with the first patch than with the others). If it's not then existing users have to reconnect to get the new account to their rosters.

When an account is in your roster, *sometimes* presence is not sent causing buddies seen as offline while they are connected or online when the contact is disconnected. That confirm what I have seen in logs from various tickets describing similar issues.

[1] http://git.collabora.co.uk/?p=user/cassidy/ejabberd-rpm;a=shortlog;h=refs/heads/XS

[2] http://people.collabora.co.uk/~cassidy/ejabberd-patches/

comment:18 Changed 6 years ago by gdesmott

I tested ejabberd 2.0.2 beta1 and current HEAD (revision 1553) with a shared roster having @all@ as members. As this is an upstream feature, no patch was applied on ejabberd.

I noticed exactly the same problems as with our @online@ shared roster. So I'm pretty sure these bugs are upstream and not related to our patches.
I reported them upstream:

https://support.process-one.net/browse/EJAB-730

https://support.process-one.net/browse/EJAB-731

https://support.process-one.net/browse/EJAB-732

comment:19 Changed 6 years ago by sascha_silbe

  • Cc sascha_silbe added
Note: See TracTickets for help on using tickets.