Ticket #4193 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Two XOs were connected to an access point and were still running salut

Reported by: yani Owned by: dgilmore
Priority: blocker Milestone: Update.1
Component: telepathy-other Version:
Keywords: Cc: morgs, smcv, daf, robot101, kimquirk, gdesmott, cscott, jg
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

616 q2c28 logs:messages.APsalut

messages.APsalut2

Both XO were connected to MediaLab AP, and showed running salut. The _presence._tcp entries were the same as the ones displayed in the neighbor view, and were 18.85.x, 169.254.x addrs. It was not connected to any jabber server(checked with netstat)

Also,one of the two(messages.APsalut) had an interseting additional bug: The circles were all dissapeared from the neighbour view. The mesh-start existed, but "cat mesh-start" gave "file doesnt exist". Maybe it was corrupt for some reason

Attachments

messages.APsalut Download (126.2 KB) - added by yani 7 years ago.
messages.APsalut2 Download (92.5 KB) - added by yani 7 years ago.
logs.gabble.zip Download (123.5 KB) - added by yani 7 years ago.
netlog.20071102.23.06.51.tgz Download (16.5 KB) - added by yani 7 years ago.
loudmouth-reconnect-failure.diff Download (1.5 KB) - added by robot101 7 years ago.
fix GSource leak and resulting SEGV in Loudmouth

Change History

Changed 7 years ago by yani

Changed 7 years ago by yani

  Changed 7 years ago by kimquirk

  • owner changed from dcbw to simon
  • component changed from network manager to telepathy-gabble

  Changed 7 years ago by kimquirk

  • priority changed from normal to high

  Changed 7 years ago by smcv

  • owner changed from simon to smcv

"simon" is not me. Reassigning

follow-up: ↓ 7   Changed 7 years ago by smcv

  • owner changed from smcv to morgs
  • component changed from telepathy-gabble to presence-service

Yani, can you get me the Presence Service, telepathy-salut and telepathy-gabble logs for the offending XOs? < http://wiki.laptop.org/go/OLPC_Trial-2_Software_Release_Notes#Enabling_logging>

I think this is more likely to be a PS bug than Gabble or Salut, but we can't really tell without logs.

  Changed 7 years ago by smcv

  • cc morgs added
  • owner changed from morgs to smcv

*grumbles about implicit reassignment*

  Changed 7 years ago by jg

  • milestone changed from Never Assigned to First Deployment, V1.0

Changed 7 years ago by yani

in reply to: ↑ 4   Changed 7 years ago by yani

i have upoloaded 3 folders for 3 different cases, names as gabble,gabble2,gabble3

gabble:school server + salut gabbl2:AP + salut gabbl3:school server + salut(gabble was also running simultaneously[seen with ps], but the neighview+buddy list showed salut)

they include messages preservice log gabble log sault log info:build,fw netsatus:network connections info

  Changed 7 years ago by smcv

I think you've labelled cases 1 and 2 backwards - case 1 looks more like an AP - but they're basically the same anyway.

Case 1:

* PS received an IP address change event from NM (a Media Lab 18.x address) * PS asked Gabble to connect to the server (Connect() was called and succeeded - this means the connection attempt is in progress and will proceed asynchronously) * Loudmouth (the XMPP library Gabble uses) failed to connect with IPv6, retried using IPv4, and succeeded (18.85.46.41 port 5222) * At the point when the logs were taken, Gabble had sent the XML headers for the beginning of the XMPP stream to the server, but had not yet received the equivalent headers from the server * As a result, Gabble never told PS that it had finished connecting (side note: when Telepathy connection managers signal status CONNECTED, it actually means they've done some initial setup on the new connection too, so that all of our API ready to use)

Diagnosis: wait a bit longer? I can't see how long you waited - we should patch Gabble to put timestamps in its log, I think there's a patch floating around somewhere to do this.

Case 2:

* PS received an IP address change event from NM (172.x) * Once again, it asked Gabble to try connecting * Once again, IPv6 failed but IPv4 succeeded * Once again, we never got any data back from the server, so didn't get any closer to signalling CONNECTED

Diagnosis: the same

Case 3:

* PS received a change to a 172.x address from NM * It asked Gabble to try connecting * The IPv4 address signalled by network-manager became invalid ("IP4 address now None"), so PS asked Gabble to disconnect * PS received a change to a 169.x link-local address from NM * It tried telling Gabble to connect again (probably we shouldn't even try this, because with a 169.x address and no nameserver or default gateway you won't get far...) * Unsurprisingly, Connect() failed because DNS resolution of jabber.laptop.org failed * PS received a change to 172.x from NM and told Gabble to connect again * DNS still failed, so still no connection * It appears the retry timer wasn't reached, so by the time you gave up, PS hadn't retried

Diagnosis: your network isn't stable enough for Gabble to get a connection

Situations where Gabble can't possibly connect, so it can't be considered a PS or Gabble bug that it fails to do so:

* Can't reach DNS or can't resolve server hostname (to test: "getent hosts jabber.laptop.org", replacing hostname if necessary) * Can't get IP to the server (to test: "ping jabber.laptop.org") * Can't get bidirectional TCP to the server (to test: "telnet jabber.laptop.org 5222", type "hello" and press Enter, you should get a response containing "xml-not-well-formed" and your connection should get closed by the remote host)

follow-up: ↓ 10   Changed 7 years ago by smcv

  • cc smcv added
  • component changed from presence-service to network manager

... in fact, since Presence Service isn't responsible for ensuring IP connectivity and DNS, let's hand this bug back.

in reply to: ↑ 9   Changed 7 years ago by yani

  • priority changed from high to blocker
  • component changed from network manager to presence-service

Indeed, your comments are very helpful.

However, there is nothing wrong with the IP connectivity, as far as I can tell that is. $getent hosts jabber.laptop.org #tried it several times

18.85.46.41 jabber.laptop.org <--- rare output 2001:4830:2446:ff00:201:6cff:fe07:68ec jabber.laptop.org <--- common ouput

$ping jabber.laptop.org

PING jabber.laptop.org (18.85.46.41) 56(84) bytes of data. 64 bytes from jabber.laptop.org (18.85.46.41): icmp_seq=1 ttl=63 time=67.4 ms 64 bytes from jabber.laptop.org (18.85.46.41): icmp_seq=14 ttl=63 time=17.9 ms ...

i am also uploading the logs for the above example, in case something has changed

i did not test telnet, since it is not available in the builds

so everything seems pretty fine

I also tried PRESENCE_SERVICE_DEBUG=disable-salut, but no success

this bug is very common the last days in the lab, and doesnt look like a networkmanager issue

Changed 7 years ago by yani

  Changed 7 years ago by smcv

  • status changed from new to assigned

The netlog tarball isn't helpful - you don't have debug logs enabled. I believe the procedure these days is to move /home/olpc/.xsession-debug to /home/olpc/.xsession and edit it, or something of the sort. Enabling debug logs for Gabble, Salut and Presence Service would be extremely helpful.

If the XOs have netcat instead of telnet, "nc jabber.laptop.org 5222" would be a suitable replacement for the telnet command I quoted. If they don't, they probably should...

  Changed 7 years ago by smcv

Daf has had an idea about a possible reason for failure to connect - he theorizes that glibc is caching wrong resolv.conf information in the Gabble process, and failing to refresh it when NetworkManager rewrites resolv.conf. If Gabble exits and is re-run, it might pick up the new resolv.conf.

Yani: next time you observe this situation, after doing the getent/ping/nc sequence, could you please run the command "pkill telepathy-g"? If pkill isn't shipped on the XOs, try "killall telepathy-gabble", or any other convenient way to kill the telepathy-gabble process.

Because of morgs' recent work on making Presence Service cope with connection manager crashes, after around 5 seconds, Gabble should get run again, and have another go at connecting. If Daf is right, this time it will re-read resolv.conf and actually find a DNS server.

This bug might be causing #4700, so if we can find a solution, I think it's a candidate for Update.1.

  Changed 7 years ago by jg

  • cc daf, added
  • milestone changed from Update.2 to Update.1

  Changed 7 years ago by daf

  • status changed from assigned to new
  • owner changed from smcv to robot101
  • cc robot101 added; removed

I suspect that some instances of this are due to a Gabble crash that Rob and I found today. Rob has a patch which we'll try and get into Ship.1.

  Changed 7 years ago by daf

Er, I meant Update.1.

The problem happens when we have multiple addresses for the Jabber server and connecting to the first one fails. This commonly happens when name resolution yields an IPv6 address and an IPv4 address, but connecting to the IPv6 address fails for some reason.

  Changed 7 years ago by robot101

  • status changed from new to assigned

Exciting discovery of the day... jabber.laptop.org has an IPv6 address and an IPv4 address. The IPv6 address is a cunning decoy which isn't actually reachable on port 5222. Loudmouth 1.2.x's code for falling back to other resolver results and retrying connecting is very subtly broken, leading to leaking a GSource which then fires after the connection has been successfully established, reading uninitialised memory, and then most probably crashing.

Ubuntu has a  horrific workaround which they've applied and Debian has synced in, but I think the attached patch fixes the real issue. In some connection failure paths, _lm_connection_failed_with_error was being called and kicking off a second connection attempt, but then the source ID was being overwritten, meaning the source wasn't destroyed, so the callback was still being called after connect_data had been freed. I guess at some point in the past _lm_connection_failed_with_error didn't exist or do retries, so zeroing out the source ID was a reasonable thing to do in case of failure. I've also added some asserts which should trigger if my understanding of the problem and solution are wrong, and these don't trigger on my machine in either the failure or success cases.

Will try and push the patch up to Fedora (and Debian & Ubuntu too :D).

Changed 7 years ago by robot101

fix GSource leak and resulting SEGV in Loudmouth

  Changed 7 years ago by robot101

  • cc kimquirk added

Brian Pepple has put a loudmouth update for FC7 into Koji with this patch, see [build info  http://koji.fedoraproject.org/koji/buildinfo?buildID=24163].

  Changed 7 years ago by gdesmott

  • cc gdesmott added

  Changed 7 years ago by robot101

  • cc cscott, jg added
  • owner changed from robot101 to dgilmore
  • status changed from assigned to new
  • component changed from presence-service to telepathy-other

Reassigning to Dennis so that we can pull loudmouth-1.2.3-1.fc7 into the build from FC7 updates. CC'ing Jim and Scott for approval.

  Changed 7 years ago by jg

approved.

  Changed 7 years ago by dgilmore

  • status changed from new to closed
  • resolution set to fixed

loudmouth-1.2.3-1.fc7 has been tagged for olpc2-update1

  Changed 7 years ago by robot101

  • status changed from closed to reopened
  • resolution fixed deleted

mstone also tagged it in Koji for dist-olpc2 which is where he said joyride currently pulls from, but the joyride build 268 from just now still pulled in loudmouth 0:1.2.2-3.fc7. Any ideas?

  Changed 7 years ago by dgilmore

  • status changed from reopened to closed
  • resolution set to fixed

mstone's tagging of the package did not work for some reason. I untagged it and retagged it it is now there

Note: See TracTickets for help on using tickets.