Ticket #7785 (new defect)

Opened 6 years ago

Last modified 6 years ago

unreliable ethernet connection

Reported by: AlbertCahalan Owned by: dcbw
Priority: high Milestone: 9.1.0-cancelled
Component: network manager Version: not specified
Keywords: Cc: gregorio, mikus
Action Needed: never set Verified: no
Deployments affected: 8.2 Blocked By:
Blocking:

Description

(joyride 2241, from August 2008)

The laptop has a tendency to:

1. pick a bogus IP address

2. choose wireless, despite having a USB ethernet device

I wouldn't have plugged in an ethernet device just for decoration. Obviously I expect to use it. If I wanted wireless, I'd have not plugged in the ethernet device. If the laptop sees an ethernet device, then there is only one reasonable assumption: the user expects to use ethernet.

Picking a bogus IP address is never useful. It just gets in the way of getting a real one. I suspect that this happens when the DHCP server is slow to respond. Once I get the bogus address, there is no reasonable way to recover. I can cut power to the laptop, forcing things to start over, but that is not reasonable.

Attachments

nm-tool.out (1.6 kB) - added by mikus 6 years ago.
logs.CSN74804910.2009-01-04.20-06-41.tar.bz2 (1.3 MB) - added by mikus 6 years ago.
olpc-log a while after XO had failed in trying to switch from an ethernet connection to a wireless connection
logs.SHC839024E2.2009-01-05.00-47-38.tar.bz2 (315.4 kB) - added by mikus 6 years ago.
olpc-log taken after unplugging and replugging ethernet cable
logs.SHC839024E2.2009-01-05.00-52-03.tar.bz2 (428.7 kB) - added by mikus 6 years ago.
olpc-log taken later, after having done ctl-alt-erase to restart Sugar

Change History

follow-up: ↓ 2   Changed 6 years ago by pgf

can you be more specific about "picking a bogus IP address"? what address do you see chosen?

in reply to: ↑ 1   Changed 6 years ago by AlbertCahalan

Replying to pgf:

can you be more specific about "picking a bogus IP address"? what address do you see chosen?

The laptop picks a 169.254.xx.xx address. Theoretically, this could be useful under some extremely rare circumstances. Normally it just means connection trouble.

Problem 1: this is silent; the user is not alerted to the problem and has no GUI method to deal with it

Problem 2: having chosen the bogus address, the laptop is now satisfied -- it will not continue trying to find a working address

Put this all together, and a temporary network problem (for example, a slow DHCP server) means that the laptop silently loses connectivity until the user reboots. There is no hint as to what went wrong. All the user knows is that his web browser keeps failing. Experienced XO users will learn that the laptop is a flakey machine that just needs a reboot every now and then... kind of like Windows 95. This is terrible.

follow-up: ↓ 4   Changed 6 years ago by kimquirk

  • cc gregorio added
  • milestone changed from 8.2.0 (was Update.2) to 9.1.0

Not a blocker for 8.2. Can you explain how this affects the user.

in reply to: ↑ 3   Changed 6 years ago by AlbertCahalan

Replying to kimquirk:

Not a blocker for 8.2. Can you explain how this affects the user.

I believe I did:

Put this all together, and a temporary network problem (for example, a slow DHCP server) means that the laptop silently loses connectivity until the user reboots. There is no hint as to what went wrong. All the user knows is that his web browser keeps failing. Experienced XO users will learn that the laptop is a flakey machine that just needs a reboot every now and then... kind of like Windows 95. This is terrible.

Normal laptops (Linux, Windows, or MacOS) will recover from a network glitch. The XO needs a reboot.

follow-up: ↓ 8   Changed 6 years ago by gregorio

Hi Albert,

Sorry I still don't quite understand this one.

I agree that the XO shojuld use the wire if its connected. I added that as something to consider in the next release at: http://wiki.laptop.org/go/9.1.0#Network_Manager_Connections

A few more question on this particular bug: are you saying that any interruption of signal on the wire brings the connection down? e.g. if I unplug the wire then plug it back in 3 seconds later.

When does the XO pick a "bogus IP"? I think we only go out to dhcp at startup. Once you have an IP and are on the network, is there any time when you would need to contact the dhcp server again?

Did you find any other specific cases beyond the trouble connecting to the dhcp server?

How do we reproduce this problem? How do you know that the wireless didn't connect, find a dhcp server and get the IP address that way?

How slow was the dhcp server? How do you know it was slow? Did you get an IP address and a default gateway from DHCP (anything else)?

If this is just a suggestion then there's no need to spend a lot of time on it. If there's a failure case for USB -> Ethernet connections we probably can't hold up this release for it unless it fatal in a large fraction of cases. If this problem also affects the wireless connections and there are a significant number of cases then we would consider it for blocker status.

Thanks,

Greg S

follow-up: ↓ 7   Changed 6 years ago by pgf

without having looked at NM, i'm sure the problem is that if the laptop doesn't find or hear from a DHCP server on the ethernet within a certain (fairly short) time after coming up, then it falls back to doing link-local IP autoconfiguration (zeroconf). albert has been using the term "bogus", but the address isn't "bogus" -- it's just not an address that's useful in most cases. (a typical Windows pc will choose the exact same sort of address in this circumstance.)

if there's a trivial change to the NM configuration that would keep it from falling back to zeroconf addressing in the absence of DHCP (for ethernet only, of course), then we should make that change.

in reply to: ↑ 6   Changed 6 years ago by AlbertCahalan

Replying to pgf:

without having looked at NM, i'm sure the problem is that if the laptop doesn't find or hear from a DHCP server on the ethernet within a certain (fairly short) time after coming up, then it falls back to doing link-local IP autoconfiguration (zeroconf).

It can also happen later. Remember that a DHCP address lease has a timeout; clients try to renew the lease before it expires.

albert has been using the term "bogus", but the address isn't "bogus" -- it's just not an address that's useful in most cases. (a typical Windows pc will choose the exact same sort of address in this circumstance.)

AFAIK, a Windows PC with a zeroconf address will keep looking for DHCP. In any case, that is the useful behavior.

if there's a trivial change to the NM configuration that would keep it from falling back to zeroconf addressing in the absence of DHCP (for ethernet only, of course), then we should make that change.

This would solve the problems I suppose.

Zeroconf addresses probably should be chosen by the DHCP client daemon, not the NM daemon.

in reply to: ↑ 5   Changed 6 years ago by AlbertCahalan

Replying to gregorio:

A few more question on this particular bug: are you saying that any interruption of signal on the wire brings the connection down? e.g. if I unplug the wire then plug it back in 3 seconds later.

I think "yes" if either:

a. the DHCP server is giving out leases that are shorter than 3 seconds

b. you get unlucky, with the lease expiring while the wire is unplugged

DHCP lease times will vary by many orders of magnitude. Probably almost all lengths will be from 15 minutes to 1 month. Length is determined by local policy, often based on expectations of computer movement and sometimes based on the number of addresses remaining.

When does the XO pick a "bogus IP"? I think we only go out to dhcp at startup. Once you have an IP and are on the network, is there any time when you would need to contact the dhcp server again?

You must contact the DHCP server before your lease expires. This is why the DHCP client must run as a daemon.

Did you find any other specific cases beyond the trouble connecting to the dhcp server? How do we reproduce this problem? How do you know that the wireless didn't connect, find a dhcp server and get the IP address that way?

I checked with the ifconfig command, as root. The USB-to-ethernet device got a bogus address. A DHCP server over wireless would not (must not...) affect other devices. Also, no reasonable DHCP server would give out such an address.

To reproduce, boot the laptop with ethernet. Ask your sysadmin to tell you the length of your DHCP leases. Cut your connection until that amount of time has passed.

How slow was the dhcp server? How do you know it was slow? Did you get an IP address and a default gateway from DHCP (anything else)?

Typically the server itself is not at fault; something happens to the physical connection. For example, the power adapter to my ethernet switch might fall out of the wall socket. This could go unnoticed for some time if the laptop is not being used for web browsing, etc.

If this is just a suggestion then there's no need to spend a lot of time on it. If there's a failure case for USB -> Ethernet connections we probably can't hold up this release for it unless it fatal in a large fraction of cases. If this problem also affects the wireless connections and there are a significant number of cases then we would consider it for blocker status.

I have had a few wireless problems that could have been caused by this, but I did not think to check with the ifconfig command.

  Changed 6 years ago by mikus

  • cc mikus added
  • owner set to dcbw
  • component changed from not assigned to network manager
  • deployment_affected set to 8.2

I completely agree with everything Albert Cahalan said :

I was testing with the 'staging-7' build. Had XO connected via ethernet (its IP was 192.168.1.8). Happened to click in Neighborhood view on an icon for an external radio signal (access point ?). The icon I had clicked on pulsed for a while, never got white rim, then went back to static as it was before.

A little later, I tried to use Browse, but it could not resolve the website IP addresses. I investigated, and found that my ETHERNET adapter had now been assigned a (zeroconf?) IP address in the 192.254.x.x range (so had the mesh, which before only had an IPv6 address).

After I rebooted, the XO had no problem again being given the 192.168.1.8 address by the DHCP server on my ethernet. I'm guessing that me clicking on the AP? caused my system to drop the ethernet connection -- but when after failing to create a wireless connection the system went back to using the ethernet adapter, it did so with an *internally* assigned IP address in the 196.254.x.x range.

[In other words, I do not believe that the system even *tried* to re-contact my external DHCP server when it was trying to re-activate the ethernet connection. There is NO WAY for that 196.254.x.x address to have come from the server on the ethernet. Also, the question has been asked before -- I am satisfied that my DHCP server is *quick* acting.]


The system should remember which IP address the ethernet connection was using when that connection was "broken" (in this case, the "breaking" was done by the system itself !). And if the system does not "remember" the former IP address, it ought to do as much DHCP dialoging on a "reconnect" as it does on a "reboot".

Changed 6 years ago by mikus

Changed 6 years ago by mikus

olpc-log a while after XO had failed in trying to switch from an ethernet connection to a wireless connection

  Changed 6 years ago by mikus

Recreated the problem on another system. This time I physically disconnected the ethernet cable for five minutes, then re-connected. Afterwards, the ethernet adapter had the useless 196.254.x.x IP address.

Note: not even restarting Sugar caused the useless IP address to go away. I had to manually 'ifconfig eth0 down', then 'ifconfig eth0 up 192.168.1.7' (with additional parameters and commands) to get the XO connected to my ethernet again.

Changed 6 years ago by mikus

olpc-log taken after unplugging and replugging ethernet cable

Changed 6 years ago by mikus

olpc-log taken later, after having done ctl-alt-erase to restart Sugar

  Changed 6 years ago by mikus

Joyride-2631. I am now applying a bypass to my system, which makes my ethernet connection reliable: Manually edit /etc/udev/rules.d/70-persistent-net.rules so that the wired ethernet adapter is assigned to 'eth1', while the wifi radio is assigned to 'eth0'.


But having gotten the system to behave consistently when handling my ethernet connection, I would now want additional *manual* control over it:

  • The icon for 'Wired' should show in Frame as long as the ethernet adapter is plugged in, irrespective of whether there exists an active connection through that adapter.
  • When there is no active wired connection, the palette for the 'Wired' icon in Frame should show the entry 'Connect'. Clicking on that entry should result in the same sequence of actions as would clicking on 'Connect' in the palette of an Access_Point icon in Neighborhood View.
  • When there is an active wired connection, the palette for the 'Wired' icon in Frame should show the entry 'Disconnect'. Clicking on that entry should result in the same sequence of actions as would clicking on 'Disconnect' in the palette of the 'Wifi' icon in Frame.
Note: See TracTickets for help on using tickets.