Opened 7 years ago

Last modified 6 years ago

#6739 new defect

Activity sharing failing often with salut.

Reported by: cjb Owned by: mstone
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: telepathy-salut Version:
Keywords: release? Cc: mstone, wad, dwmw2, mbletsas, jcardona, carrano, dsaxena, mquirk
Blocked By: Blocking:
Deployments affected: Action Needed: code
Verified: no

Description

Attached are logs from the joiner and sharer of a Write activity. The joiner clicked on an "my neighborhood"-shared Write, Write launched, but the activity toolbar thought it was "Private", and the text in the sharer's Write didn't appear on the joiner's screen. Build 699, connected to an access point with Salut.

The error in the joiner's log is a duplicate of the error seen with Read yesterday, in #6722. Namely:

1206139420.698640 DEBUG sugar.presence.activity: <proxy for /org/laptop/Sugar/Presence/Activities/1 at 842a2d4>: Join failed because: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
1206139420.701489 DEBUG root: Failed to join activity: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

So, we think we have a generic sharing failure on build 699. Could someone at Collabora help us start tracking this down?

Attachments (9)

joiner-fail.tar.bz2 (37.4 KB) - added by cjb 7 years ago.
sharer-fail.tar.bz2 (47.4 KB) - added by cjb 7 years ago.
joiner-with-6576-fail.tar.bz2 (59.6 KB) - added by cjb 7 years ago.
logs.zip (75.2 KB) - added by walter 7 years ago.
logs from chat session failure
logs.CSN74400077.2008-03-28.17-49-02.tar.bz2 (44.5 KB) - added by walter 7 years ago.
this time with debugging on
w09-join.tar.bz2 (43.8 KB) - added by cjb 7 years ago.
with salut 0.2.3-2
w09-join.tcpdump-wireshark.log (361.9 KB) - added by cjb 7 years ago.
with salut-0.2.3-2, wireshark format
logs.CSN747024DC.2008-08-04.21-47-14.tar.bz2 (31.5 KB) - added by joe 6 years ago.
From the "behaving XO".
logs.CSN748054AC.2008-08-04.21-52-10.tar.bz2 (30.8 KB) - added by joe 6 years ago.
From "misbehaving" XO.

Download all attachments as: .zip

Change History (42)

Changed 7 years ago by cjb

Changed 7 years ago by cjb

comment:1 Changed 7 years ago by gdesmott

How much reproductible is this bug? Did you try with salut 0.2.3?

Unfortunately sharing/joining problem aren't easy to debug because of #6576. Maybe you could apply the patch attached to #6576 to get a more helpful error message?

comment:2 Changed 7 years ago by gdesmott

Wasn't able to reproduce using 699 and 2 XO's connected to my AP.

comment:3 Changed 7 years ago by cjb

  • Keywords release? added

comment:4 Changed 7 years ago by cjb

Reproduced with the patch from #6576, on build 702.

I don't see anything new in the ReadActivity-2.log -- it has the same "Join failed" messages -- but maybe a different log has an error. Logfiles attached.

Changed 7 years ago by cjb

comment:5 Changed 7 years ago by cjb

morgs -- looking at the patch, it looks like there should be something else attached to the "Join failed" line, but there isn't. We see:

1206574384.541851 DEBUG sugar.presence.activity: <proxy for /org/laptop/Sugar/Pr
esence/Activities/2 at 86dd89c>: Join failed because: org.freedesktop.DBus.Error
.NoReply: Did not receive a reply. Possible causes include: the remote applicati
on did not send a reply, the message bus security policy blocked the reply, the 
reply timeout expired, or the network connection was broken.
1206574384.544743 DEBUG root: Failed to join activity: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

I checked that I have the new 0.79.2-1.olpc2 presence service RPM, with rpm -q.

comment:6 Changed 7 years ago by gdesmott

That's weird. Are you sure the patch from #6576 is really applied in this PS version?

comment:7 Changed 7 years ago by marco

From the comments on #6576 I'd guess that the patch is in Joyride but *not* in Update.1.

comment:8 Changed 7 years ago by marco

Oh sorry, missed cjb comment about having installed 0.79

comment:9 Changed 7 years ago by cjb

The presence service isn't where our error is coming from, which is why the patch doesn't help.

The presence service log actually seems to say the join succeeded. The activity log doesn't. One of the places our join failure in the activity log is coming from is sugar/activity/activity.py:_internal_joined_cb().

comment:10 Changed 7 years ago by cjb

The most suspicious thing I see in the presenceservice log, as opposed to the activity log, is:

1206574420.899037 DEBUG s-p-s.activity: ... set([<ps.buddy.Buddy cjb2>]) claimed to be here but are not
...
1206574420.910425 DEBUG s-p-s.activity: <Activity #2 (ID 67911570c0922b1ea292b7e01f3f9bac0a67853a) at 831593c>: Removing buddies: set([<ps.buddy.Buddy cjb2>])
...
1206574420.940073 DEBUG s-p-s.activity: Join of activity <Activity #2 (ID 67911570c0922b1ea292b7e01f3f9bac0a67853a) at 831593c> succeeded

cjb2 was the sharer of the activity, and the only other person in it.

comment:11 Changed 7 years ago by gdesmott

According Salut log, you joined the muc 67911570c0922b1ea292b7e01f3f9bac0a67853a but didn't detect that cjb2 was already in it. We don't have the " ---Finished joining phase!!!!---" when you join this muc. I guess that's because muc's timeout expired and we didn't detect senders.

comment:12 Changed 7 years ago by cjb

Cool. Can we increase the timeout (and do you think it'll help)?

comment:13 Changed 7 years ago by sjoerd

It won't.. I had a look at both logs and it seems that salut jsut isn't receiving the multicast packets from the other node. Could you add a tcpdump log containing the multicast traffic too? (just from eth0 should be good enough).. That way we should be able to determine on what level this goes wrong.

Apart from this, it might be nice of the PS handles this case better, as it can always happen that the join won't succeed (the other kid went out of range for example)

comment:14 Changed 7 years ago by cjb

Thanks, Sjoerd. I'll get a tcpdump log within a few hours.

Is there any way we can retry the join if the expected number of participants aren't showing up? I can believe that we lost a multicast packet(s) at a bad time, since my understanding is that multicast isn't reliable and this was a two-laptop test. I imagine that if said packets had been retried, they'd have been very likely to get through.

comment:15 Changed 7 years ago by sjoerd

The clique protocol is designed to cope with those unreliabilities.. All nodes in the group ensure that every group sends out a session packet once in a while. When other/new nodes detect this the join process is started. What the salut logs show is that both nodes are sending out these packets, but they never receive the packets of the other node. So the join process is never started (well it obviously can't).. So you should never have to retry to the join, Salut handles this correctly for you.. Hopefully tcpdump will verify this

Part of the problem here is that the PresenceServices doesn't cope correctly in cases the joining of a muc doesn't have the expected result. Which in turn confuses the activity :(

comment:16 Changed 7 years ago by morgs

I'm building a salut rpm with more debug info for this problem - see #6782.

After discussion with Sjoerd, it's likely that the cause is a network issue as we have seen before - e.g. the multicast packets the node sends out aren't received by others - and that Presence Service is handling it correctly but the activity (or sugar.presence) is failing.

tcpdumps will hopefully show what is happening.

comment:17 Changed 7 years ago by walter

attaching log files from the Chat manifestation of the same... in the files, hernan initiated the chat. walter tried to join.

Changed 7 years ago by walter

logs from chat session failure

comment:18 Changed 7 years ago by cjb

Thanks, Walter. These logs don't have debugging turned on, so they don't contain the same string as mine, but I do think you're hitting this bug.

Changed 7 years ago by walter

this time with debugging on

comment:19 follow-up: Changed 7 years ago by cjb

Here's a log with the new telepathy-salut and tcpdump running. Same "join failed" error in the Write.log.

Changed 7 years ago by cjb

with salut 0.2.3-2

Changed 7 years ago by cjb

with salut-0.2.3-2, wireshark format

comment:20 Changed 7 years ago by cjb

  • Summary changed from Sharing failures with Write (and Read). to Activity sharing failing often with salut.

The relevant line in these logs from the new salut patch seems to be:

+ DEBUG ("Didn't receive muc senders. Timeout expired, "
+ "adding myself as member anyway");

comment:21 Changed 7 years ago by cjb

I ran ping from the joiner to the sharer while attempting to join.

-- 18.85.19.50 ping statics ---
115 packets transmitted, 110 received, 4% packet loss, time 120132ms
rtt min/avg/max/mdev = 3.470/7.933/31.953/6.275 ms

The join failed.

comment:22 follow-up: Changed 7 years ago by cjb

I see that the connection timeout before the "Didn't receive muc senders. Timeout expired, " error is sixty seconds (and have verified that it does wait that long). The ping shows that it is not true that the two laptops were out of contact for sixty seconds. How do we reconcile those data?

comment:23 in reply to: ↑ 19 Changed 7 years ago by sjoerd

Replying to cjb:

Here's a log with the new telepathy-salut and tcpdump running. Same "join failed" error in the Write.log.

Looking in the logs the multicast group for the activity is 255.71.147:

(telepathy-salut:3876): DEBUG: salut_muc_manager_request_new_muc_channel: found 239.255.71.147 port 30922 for room 8a503e058f5ebe2705d3b914729ee58bc62b6434

and the local machine had ip address 18.85.19.139:

IP eth0 : 18.85.19.139

Looking at your wireshark log and filtering on destination 255.71.147 we only see packets from the local ip.. So salut is indeed not receiving multicast packets from the other machine :(

comment:24 in reply to: ↑ 22 Changed 7 years ago by sjoerd

Replying to cjb:

I see that the connection timeout before the "Didn't receive muc senders. Timeout expired, " error is sixty seconds (and have verified that it does wait that long). The ping shows that it is not true that the two laptops were out of contact for sixty seconds. How do we reconcile those data?


Ping is unicast traffic. While Clique uses multicast.. So apparently for some reason multicast isn't getting through. Your logs only had the tcpdump for the joining side, but i'm pretty sure the sharing side would show a similar picture.. Which would mean it's a kernel/firmware issue.

comment:25 Changed 7 years ago by morgs

Talking about firmware and multicast traffic not getting through, that reminds me of #5432, which had the same symptoms, which we opened 4 months ago. Unfortunately it was not resolved then but rejected.

comment:26 Changed 7 years ago by cjb

  • Cc wad dwmw2 mbletsas jcardona carrano added

Thanks for looking at this, sjoerd/morgs.

Could someone with a lower-level handle on wireless try to reproduce? The steps are:

  • install update.1-703 on two laptops
  • associate them with an access point
  • start and share a Chat activity on one laptop
  • try to join it on the other
  • log multicast network traffic at the time, if it fails, look to see whether each laptop saw the clique packets

This procedure appears to be failing most of the time on the "media lab 802.11" APs at 1cc.

comment:27 Changed 7 years ago by wad

Just did that for a number of tests (Write, Read, Measure) with varying numbers of laptops. Results (traces and logs) posted at:
http://wiki.laptop.org/go/Collab_Network_School_Wifi_Tests
(all the t0330 tests).

Overall, I didn't see the problem. Read worked good for smaller documents, and write and measure worked fine. I will continue testing with more laptops to see what gives.

comment:28 Changed 7 years ago by wad

That URL changed (these are really Simple_Wifi tests!):
http://wiki.laptop.org/go/Collab_Network_Simple_Wifi_Tests

comment:29 Changed 6 years ago by dsaxena

  • Cc dsaxena added

comment:30 Changed 6 years ago by gregorio

  • Action Needed set to never set
  • Component changed from telepathy-salut to wireless
  • Owner changed from Collabora to dwmw2

comment:31 Changed 6 years ago by gregorio

  • Owner changed from dwmw2 to joe

Hi Joe,

Can you retest this one? Please also write a test case so that we can try it out regularly.

Let me know if you have any questions or need more info.

Thanks,

Greg S

Changed 6 years ago by joe

From the "behaving XO".

Changed 6 years ago by joe

From "misbehaving" XO.

comment:32 Changed 6 years ago by joe

  • Action Needed changed from never set to code
  • Cc mquirk added
  • Owner changed from joe to mstone

Tried to reproduce this bug while running the joyride-2230. Several combinations of two XOs that had the software update from build 708 (not a clean install) behaved differently: sometimes the sharing of the Write activity worked as expected, sometimes the "joining" machine didn't automatically switch the "Activity" from "Private" to "My Neighbourhood" state, and sometimes XOs became unresponsive and required a "hard reboot" by powering down machines.

The general observation: the unstable behavior of machines might have resulted from switching machines from a Media Lab network to a mesh network. Machines that spent the whole network just being connected to the Internet were found to be stable.

Two logs are attached (one from the "properly" behaving machine - CSN747024DC and the other - from the "problematic" machine - CSN748054AC.

comment:33 Changed 6 years ago by sjoerd

  • Component changed from wireless to telepathy-salut

Unfortunately your logs don't include logs for salut or network traces. http://wiki.laptop.org/go/Telepathy_debugging explains how to get more detailled log output for salut. For a network trace, a dump using tcpdump would be quite usefull to pinpoint where in the stack the problem lies.

Note: See TracTickets for help on using tickets.