Ticket #5313 (new defect)

Opened 7 years ago

Last modified 6 years ago

Ejabberd falls over

Reported by: daf Owned by: Collabora
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: telepathy-other Version:
Keywords: Cc: gdesmott, jg, kimquirk, gregorio, wad, martin.langhoff
Action Needed: communicate Verified: no
Deployments affected: Blocked By:
Blocking:

Description (last modified by gregorio) (diff)

We use Ejabberd as our XMPP (Jabber) implementation for collaboration when an access point is present. Ejabberd keeps running out of memory when running on jabber.laptop.org. This needs to be fixed before use of the XMPP server, which was disabled for Ship.2, is renabled for Update.1.

Change History

Changed 7 years ago by robot101

  • status changed from new to assigned

I'm working with Aleksey Schepin (ejabberd developer) to track this down. He should be able to log in and inspect the running server on jabber.laptop.org tomorrow (Wednesday 5th).

Changed 7 years ago by gdesmott

  • cc gdesmott added

Changed 7 years ago by robot101

  • cc jg, kimquirk added

He seems to have decided that the problem is that our shared roster is too large, and that this simply results in too many <presence> messages being generated internally for mod_pubsub to handle in a reasonable time, so it runs out of memory due to too many unhandled messages after a certain number of users log in in quick succession. He says the best thing for us to do is make smaller shared roster groups.

He suggested a) a shared roster that would automatically only show Online people, but given the scalability target of 5000 users in a large school, this would still reach the roster sizes we have at jabber.laptop.org, or b) groups which are based off some organisational unit at the school, such as by class or year, although I don't know whether such information is actually stored on the school server at this juncture, and even if it was I don't think it would be practical to hook it up to the jabber server on the Update.1 timescale. Given these constraints, he's going to look at writing some patches for us for so that we can have the "Nearby" and "Random" groups as suggested in #5311, and hopefully kill two birds with one stone.

In the meantime, he suggested trying the trunk version (#5315) which I am now doing, given that the PEP functionality has undergone some improvements between the patch we were running, and current trunk, to see if that helps.

Changed 7 years ago by robot101

It still fell over overnight, but Christophe Romain upstream said we could enable some new more efficient PEP code which they just (after I made the RPMs last night :D) enabled by default. I made the appropriate configuration change, and am now monitoring the situation.

Changed 7 years ago by robot101

Nope... fell over again.

Changed 7 years ago by robot101

I've applied a hack to jabber.laptop.org which changes the shared roster module so it only returns users active in the last 7 days (~1500 people) instead of all of them (~6000). This should reduce the memory spiking until we get some more input from the ejabberd guys or a fix for #5311.

Changed 7 years ago by daf

  • blockedby 5934 added

Changed 7 years ago by mbletsas

  • priority changed from normal to blocker
  • blockedby 5934 removed

Changed 6 years ago by gdesmott

  • owner changed from robot101 to Collabora
  • status changed from assigned to new

Changed 6 years ago by gregorio

  • next_action set to never set
  • description modified (diff)

Changed 6 years ago by gregorio

  • cc gregorio added

Please test this and tell us what the limit is in terms of XOs which are on and associated with the server. A conservative and working number if what we need to start with.

Let me know if you have any questions or I did not correctly represent our triage discussions.

Thanks,

Greg S

Changed 6 years ago by mstone

  • cc wad, martin.langhoff added
  • next_action changed from never set to communicate

Sjoerd asked: "what hardware should we be testing the XS software with?"

There were two replies:

"What have you currently got?"

and

"It would be good to use the same hardware that was used in the CNT."

Unfortunately, Wad did not indicate what hardware he used on that page. Wad?

Relatedly, what test plans should Collabora help to execute? Those already listed on the CNT page?

Changed 6 years ago by gregorio

Hi Collabora team,

Did we make any progress testing this?

We need progress on this one soon. Birmingham is about to roll out a lot of Xos and they want to know how scalable the eJabber server is...

Thanks,

Greg S

Changed 6 years ago by gdesmott

I updated the XS ejabberd package: http://git.collabora.co.uk/?p=user/cassidy/ejabberd-rpm;a=shortlog;h=refs/heads/XS

Daf and Sjoerd are working on tools to help us for the tests.

Changed 6 years ago by martin.langhoff

Guillaume -

where are the testing tools? I'd like to be able to test ejabberd and force the memory spike problems myself...

Changed 6 years ago by daf

Hyperactivity is on dev.laptop.org. I recommend using my branch for now as it has a bug fix that Guillaume's doesn't:

http://dev.laptop.org/git?p=users/daf/hyperactivity;a=summary

Note: See TracTickets for help on using tickets.