Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#8104 closed defect (fixed)

dhclient dbus message gets lost

Reported by: dsd Owned by: cjb
Priority: normal Milestone: 8.2.2
Component: network manager Version: not specified
Keywords: joyride-2301:- 8.2.0? relnote cjbfor8.2 cjbfor9.1.0 ml8.2.3 Cc: cjb, martin.langhoff
Blocked By: Blocking:
Deployments affected: Action Needed: add to build
Verified: no

Description

We suspended about 10 systems running joyride-2301. 3 of them resumed but did not re-establish their connection to their infrastructure network.

This is strange because they did actually reassociate and get an IP address, but NetworkManager did not realise:

Aug 22 14:32:13 localhost NetworkManager: <info>  Old device 'eth0' activating, won't change.
Aug 22 14:32:14 localhost avahi-daemon[1086]: Registering new address record for fe80::217:c4ff:fe10:ef5f on eth0.*.
Aug 22 14:32:15 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
Aug 22 14:32:15 localhost dhclient: DHCPOFFER from 172.18.0.1
Aug 22 14:32:15 localhost dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Aug 22 14:32:15 localhost dhclient: DHCPACK from 172.18.0.1
Aug 22 14:32:15 localhost dhclient: bound to 172.18.0.189 -- renewal in 5074 seconds.
Aug 22 14:32:57 localhost NetworkManager: <info>  Device 'eth0' DHCP transaction took too long (>45s), stopping it.
Aug 22 14:32:57 localhost NetworkManager: <info>  nm_dhcp_manager_cancel_transaction_real(): enter, pid 2992
Aug 22 14:32:57 localhost NetworkManager: <info>  nm_dhcp_manager_cancel_transaction_real(): kill -TERM-ing pid 2992
Aug 22 14:32:57 localhost NetworkManager: <info>  nm_dhcp_manager_cancel_transaction_real(): waiting for 2992 to exit
Aug 22 14:32:57 localhost NetworkManager: <info>  nm_dhcp_manager_cancel_transaction_real(): i was 19 after wait pid 2992
Aug 22 14:32:57 localhost NetworkManager: <info>  eth0: canceled DHCP transaction, dhclient pid 2992

Attachments (9)

messages.txt (46.9 KB) - added by dsd 6 years ago.
/var/log/messages
dbus.txt (29.6 KB) - added by dsd 6 years ago.
dbus monitor logs
messages2.txt (27.7 KB) - added by dsd 6 years ago.
/var/log/messages
nm-dhcp-client-action.c (4.4 KB) - added by dcbw 6 years ago.
hacked-up 0.7 dhclient callout to test race hypothesis
nm-dhcp-action (9.7 KB) - added by dsd 6 years ago.
compiled nm-dhcp-client-action.c
logs.CSN748028A5.2009-03-17.08-38-08.tar.bz2 (264.2 KB) - added by mikus 6 years ago.
801. Log taken after twice clicking on AP icon in Neighborhood View -- both times AP icon pulsed for a while, then went back to "unconnected".
logs.CSN748028A5.2009-03-18.12-51-29.tar.bz2 (363.8 KB) - added by mikus 6 years ago.
connect to AP on channel 10 failed (binary permissions 0777)
logs.CSN748028A5.2009-03-18.13-43-16.tar.bz2 (564.1 KB) - added by mikus 6 years ago.
connect to AP on channel 1 failed (binary permissions 0777)
logs.CSN748028A5.2009-03-18.13-48-16.tar.bz2 (681.6 KB) - added by mikus 6 years ago.
connect to AP on channel 1 succeeded (replaced binary with previous batch script)

Download all attachments as: .zip

Change History (43)

Changed 6 years ago by dsd

/var/log/messages

comment:1 Changed 6 years ago by dsd

  • Cc cjb added

Some systems were suspended by pressing the power button, others by closing the lid. All 3 failed systems were able to reconnect after manually clicking on the AP again in neighbourhood view.

comment:2 in reply to: ↑ description Changed 6 years ago by cjb

That's interesting. Could really use a fix for this.

comment:3 Changed 6 years ago by dcbw

Please run 'dbus-monitor --system' before suspending and then reproduce the issue on resume and grab the output. It's one of a few things:

1) dhclient is failing to execute the action script
2) the action script is exacerbating a dbus race where short-lived processes last signals/messages sometimes don't get delivered successfully
3) NM is for some reason not correctly listening for the incoming message
4) The OOM killer is killing the action script

Running dbus-monitor and seeing if the DHCP messages are getting across the system bus will isolate the problem to either NM or to dhclient+script.

comment:4 follow-up: Changed 6 years ago by cjb

Interesting, I can't reproduce this at home (although I guess it was sporadic at 1cc). I'm connected to AP, I suspend/resume, it looks for mesh, scan results for eth0 comes in, it gives up on mesh and reconnects to the AP automatically.

I'll get dbus-monitor logs from the machines in the office.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 6 years ago by thomaswamm

Replying to cjb:

Interesting, I can't reproduce this at home...

Much to my surprise, I also could not reproduce this. I had thought that suspend kills WiFi until a manual Connect. But from this ticket I learn that reconnection is supposed to be automatic. So I try it, grit my teeth and wait, and voila, the XO does automagically reconnect a couple minutes after a resume event (via Power button or lid opening). Surfing thru Trac is educational.

I would prefer it try the access point first, then try mesh only as a last resort. There are no other XO's in my neighbourhood, so mesh is a waste of time.

comment:6 in reply to: ↑ 5 Changed 6 years ago by cjb

Replying to thomaswamm:

I would prefer it try the access point first, then try mesh only as a last resort. There are no other XO's in my neighbourhood, so mesh is a waste of time.

This is a case of G1G1 conflicting with deployments; if there's a school server in the environment, we need to search for and use that before falling back to an AP.

comment:7 Changed 6 years ago by dsd

Just saw this again on a clean boot, so it's probably not related to suspend/resume

Changed 6 years ago by dsd

dbus monitor logs

Changed 6 years ago by dsd

/var/log/messages

comment:8 Changed 6 years ago by dsd

It happened again on the next boot. I was running and logging dbus-monitor that time. See attached logs.

comment:9 Changed 6 years ago by dcbw

dbus.txt indicates that dhclient executed the script successfully for the PREINIT state, but that dhclient never successfully executed the script for the BOUND state. Thus, the problem appears to be somewhere between dhclient and the NM dhclient-script. Possibly out-of-memory errors or that D-Bus race (that should be fixed already in Fedora 9 I'm pretty sure)...

So the issue is not with NM, but definitely with dhclient and/or dbus since for some reason the correct DHCP events do not get sent out over the bus, and thus NM obviously doesn't know about the successful BIND state of dhclient.

comment:10 Changed 6 years ago by cjb

Thanks, Dan. I don't think we'd get OOM straight after boot, so I guess we should try the race. Is there a "sleep" we can stick somewhere to verify, or something like that?

comment:11 Changed 6 years ago by dcbw

Not with the 0.6.x olpc branch NetworkManager, because the dbus-send call will exit when it's done. Could modify the 0.7 branch dhclient callout (which is in C and thus you can sleep(5) before closing the d-bus connection) to send back the same format as the 0.6.x one, or try to figure out where the race stuff ended up in dbus and test-patch dbus for that.

References (same message):
http://lists.freedesktop.org/archives/dbus/2008-March/009543.html
http://marc.info/?l=freedesktop-dbus&m=120667412413149&w=2 (better patch handling in the web ui)

Maybe try the second patch (dbus-race-hack.diff) and see if that seems to fix the issue in a test build?

Changed 6 years ago by dcbw

hacked-up 0.7 dhclient callout to test race hypothesis

comment:12 Changed 6 years ago by dcbw

Or try the nm-dhcp-client-action.c I just added, compile-tested-only. Replace the existing callout from the OLPC NM 0.6.x. It should send the data in the same format as the 0.6.x shellscript callout. It'll sleep for 3 seconds before closing the dbus connection.

comment:13 Changed 6 years ago by dcbw

gcc -g -o nm-dhcp-client.action pkg-config --libs --cflags dbus-glib-1 glib-2.0 nm-dhcp-client-action.c

comment:14 Changed 6 years ago by dcbw

thanks, trac. use backticks right before pkg-config and right after the glib-2.0 of course

comment:15 Changed 6 years ago by kimquirk

  • Action Needed changed from never set to code
  • Keywords 8.2.0? added
  • Milestone changed from Not Triaged to 8.2.1

Daniel showed this to me today on a laptop that failed to connect to the TestAP. When you click on the access point in the neighborhood view, then it will connect. So this is not a blocker. It would be great to get a fix for 8.2.1 and get it into joyride soon so it gets appropriate testing.

comment:16 Changed 6 years ago by dsd

I built the current dbus with dbus-race-hack.diff: http://dev.laptop.org/~dsd/20080828/dbus-1.2.1-2.olpc3.1.i386.rpm

comment:17 Changed 6 years ago by mstone

  • Keywords relnote added

This is important to us but definitely not a blocker -- it either needs to be fixed or carefully relnoted. We will happily accept fixes until next week, after which time it would be harder to justify fixing this.

comment:18 Changed 6 years ago by cjb

Okay, I'm working on this today.

comment:19 Changed 6 years ago by cjb

Well, it seems to be working, but at the moment I'm on a sample size of one..

comment:20 Changed 6 years ago by dsd

  • Summary changed from NetworkManager failed to reconnect after resume to dhclient dbus message gets lost

Some more data: While attempting to reproduce another bug, I booted 6 laptops 5 times. Out of the 30 boots, I saw this bug once. (this was using vanilla 8.2-757, without applying any dhclient/dbus fixes)

comment:21 Changed 6 years ago by cjb

  • Keywords cjbfor8.2 added

Just saw this again. I think we should get the dbus-race-hack into Joyride, although I still don't know for sure whether it helps.

comment:22 Changed 6 years ago by mstone

  • Action Needed changed from code to add to build

Fine by me.

comment:23 Changed 6 years ago by cscott

  • Owner changed from dcbw to cjb

comment:24 Changed 6 years ago by mstone-xmlrpc

  • Keywords cjbfor9.1.0 added
  • Milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

comment:25 Changed 6 years ago by dsd

Some messy scripts I'm using to reproduce this, which connect/reconnect in a loop until failure:

main script

while true; do
	echo "connecting"
	sh nmconnect.sh
	conn=0
	for i in $(seq 0 10); do
		sleep 5
		if ./nmchkconn.sh; then
			conn=1
			break
		fi
	done
	if [[ $conn == 0 ]]; then
		echo "failed to connect!"
		exit 1
	fi 
	echo "connected, now mesh"
	sh nmmesh.sh
	sleep 10
done

nmconnect.sh

dbus-send --print-reply --system --dest=org.freedesktop.NetworkManager \
	/org/freedesktop/NetworkManager \
	org.freedesktop.NetworkManager.setActiveDevice \
	objpath:"/org/freedesktop/NetworkManager/Devices/eth0" \
	string:"default"

nmchkconn.sh

out=$(dbus-send --print-reply --system --dest=org.freedesktop.NetworkManager \
	/org/freedesktop/NetworkManager/Devices/eth0 \
	org.freedesktop.NetworkManager.Devices.eth0.getProperties | \
	sed -n 7p)

state=${out:10:1}
echo "state $state"
[[ $state == "7" ]] && exit 0
exit 1

nmmesh.sh

dbus-send --print-reply --system --dest=org.freedesktop.NetworkManager \
	/org/freedesktop/NetworkManager \
	org.freedesktop.NetworkManager.setActiveDevice \
	objpath:"/org/freedesktop/NetworkManager/Devices/msh0"

Changed 6 years ago by dsd

compiled nm-dhcp-client-action.c

comment:26 Changed 6 years ago by dsd

I've attached a compiled version of nm-dhcp-client-action.c which I can confirm fix the problems. The scripts above reproduce the issue for me usually in less than 10 minutes (it helps to close sugar first), and Dan's nm-dhcp-client-action.c hack has survived the same script for more than 3 hours before I got bored. I'm going to include this in Paraguay's deployment.

Also made some noise on the dbus mailing list as it looks like this 5-year old issue still exists even in the latest code :(

comment:27 Changed 6 years ago by mikus

Ran with 801. [I believe the fix is compatible with NM 0.6] With the binary nm-dhcp-client.action inserted, my system was unable to connect to a channel 6 AP (which does not require a key). With the previous nm-dhcp-client.action script put back in place, the system connected immediately.

Changed 6 years ago by mikus

  1. Log taken after twice clicking on AP icon in Neighborhood View -- both times AP icon pulsed for a while, then went back to "unconnected".

comment:28 Changed 6 years ago by dsd

Thanks for testing.

Mar 17 13:37:18 localhost dhclient: execve (/etc/NetworkManager/callouts/nm-dhcp-client.action, ...): Permission denied

Make sure that it is readable and executable by all users and then it should work

Changed 6 years ago by mikus

connect to AP on channel 10 failed (binary permissions 0777)

Changed 6 years ago by mikus

connect to AP on channel 1 failed (binary permissions 0777)

Changed 6 years ago by mikus

connect to AP on channel 1 succeeded (replaced binary with previous batch script)

comment:29 Changed 6 years ago by mikus

bash-3.2# ls -la /etc/NetworkManager/callouts/
total 6
drwxr-xr-x 2 root root    0 2009-03-18 13:45 .
drwxr-xr-x 4 root root    0 2009-03-15 05:54 ..
-rwxr-xr-x 4 root root  500 2007-11-02 07:00 nm-avahi-autoipd.action
-rwxr-xr-x 4 root root  375 2007-11-02 07:00 nm-dhcp-client.action
-rwxrwxr-x 1 root root 4781 2009-03-17 14:12 nm-dhcp-client.action.new
bash-3.2# md5sum /etc/NetworkManager/callouts/nm-dhcp-client.action.new
13ce3072de58a1f25f668bcde406cd0d  /etc/NetworkManager/callouts/nm-dhcp-client.action.new
bash-3.2#

Whenever I want to test the binary, I rename client.action to client.action.prv, and client.action.new to client.action (and reboot)

comment:30 Changed 6 years ago by dsd

Either you downloaded the file incorrectly or have experienced corruption. It should be 9932 bytes with md5sum be0181823b6442c8eb73397f1d40cf6c. I just downloaded it from this ticket and confirmed that it is correct here.

comment:31 Changed 5 years ago by martin.langhoff

  • Cc martin.langhoff added
  • Keywords ml8.2.3 added

comment:32 Changed 5 years ago by martin.langhoff

The mentioned C implementation of nm-dhcp-client.action is now rolled into an RPM and tests OK - here

http://dev.laptop.org/~martin/public_rpms/8.2-papercuts/NetworkManager-0.6.5-0.14.svn3246.olpc3.i386.rpm

comment:33 Changed 5 years ago by martin.langhoff

  • Milestone changed from 9.1.0-cancelled to 8.2.2
  • Resolution set to fixed
  • Status changed from new to closed

comment:34 Changed 5 years ago by martin.langhoff

Daniel is tracking this for the F11 builds as #9750

Note: See TracTickets for help on using tickets.