Opened 7 years ago

Last modified 6 years ago

#5313 new defect

Ejabberd falls over

Reported by: daf Owned by: Collabora
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: telepathy-other Version:
Keywords: Cc: gdesmott, jg, kimquirk, gregorio, wad, martin.langhoff
Blocked By: Blocking:
Deployments affected: Action Needed: communicate
Verified: no

Description (last modified by gregorio)

We use Ejabberd as our XMPP (Jabber) implementation for collaboration when an access point is present. Ejabberd keeps running out of memory when running on This needs to be fixed before use of the XMPP server, which was disabled for Ship.2, is renabled for Update.1.

Change History (16)

comment:1 Changed 7 years ago by robot101

  • Status changed from new to assigned

I'm working with Aleksey Schepin (ejabberd developer) to track this down. He should be able to log in and inspect the running server on tomorrow (Wednesday 5th).

comment:2 Changed 7 years ago by gdesmott

  • Cc gdesmott added

comment:3 Changed 7 years ago by robot101

  • Cc jg kimquirk added

He seems to have decided that the problem is that our shared roster is too large, and that this simply results in too many <presence> messages being generated internally for mod_pubsub to handle in a reasonable time, so it runs out of memory due to too many unhandled messages after a certain number of users log in in quick succession. He says the best thing for us to do is make smaller shared roster groups.

He suggested a) a shared roster that would automatically only show Online people, but given the scalability target of 5000 users in a large school, this would still reach the roster sizes we have at, or b) groups which are based off some organisational unit at the school, such as by class or year, although I don't know whether such information is actually stored on the school server at this juncture, and even if it was I don't think it would be practical to hook it up to the jabber server on the Update.1 timescale. Given these constraints, he's going to look at writing some patches for us for so that we can have the "Nearby" and "Random" groups as suggested in #5311, and hopefully kill two birds with one stone.

In the meantime, he suggested trying the trunk version (#5315) which I am now doing, given that the PEP functionality has undergone some improvements between the patch we were running, and current trunk, to see if that helps.

comment:4 Changed 7 years ago by robot101

It still fell over overnight, but Christophe Romain upstream said we could enable some new more efficient PEP code which they just (after I made the RPMs last night :D) enabled by default. I made the appropriate configuration change, and am now monitoring the situation.

comment:5 Changed 7 years ago by robot101

Nope... fell over again.

comment:6 Changed 7 years ago by robot101

I've applied a hack to which changes the shared roster module so it only returns users active in the last 7 days (~1500 people) instead of all of them (~6000). This should reduce the memory spiking until we get some more input from the ejabberd guys or a fix for #5311.

comment:7 Changed 7 years ago by daf

  • Blocked By 5934 added

comment:8 Changed 7 years ago by mbletsas

  • Blocked By 5934 removed
  • Priority changed from normal to blocker

comment:9 Changed 6 years ago by gdesmott

  • Owner changed from robot101 to Collabora
  • Status changed from assigned to new

comment:10 Changed 6 years ago by gregorio

  • Action Needed set to never set
  • Description modified (diff)

comment:11 Changed 6 years ago by gregorio

  • Cc gregorio added

Please test this and tell us what the limit is in terms of XOs which are on and associated with the server. A conservative and working number if what we need to start with.

Let me know if you have any questions or I did not correctly represent our triage discussions.


Greg S

comment:12 Changed 6 years ago by mstone

  • Action Needed changed from never set to communicate
  • Cc wad martin.langhoff added

Sjoerd asked: "what hardware should we be testing the XS software with?"

There were two replies:

"What have you currently got?"


"It would be good to use the same hardware that was used in the CNT."

Unfortunately, Wad did not indicate what hardware he used on that page. Wad?

Relatedly, what test plans should Collabora help to execute? Those already listed on the CNT page?

comment:13 Changed 6 years ago by gregorio

Hi Collabora team,

Did we make any progress testing this?

We need progress on this one soon. Birmingham is about to roll out a lot of Xos and they want to know how scalable the eJabber server is...


Greg S

comment:14 Changed 6 years ago by gdesmott

I updated the XS ejabberd package:;a=shortlog;h=refs/heads/XS

Daf and Sjoerd are working on tools to help us for the tests.

comment:15 Changed 6 years ago by martin.langhoff

Guillaume -

where are the testing tools? I'd like to be able to test ejabberd and force the memory spike problems myself...

comment:16 Changed 6 years ago by daf

Hyperactivity is on I recommend using my branch for now as it has a bug fix that Guillaume's doesn't:;a=summary

Note: See TracTickets for help on using tickets.