Ticket #10780 (closed defect: fixed)

Opened 3 years ago

Last modified 7 months ago

Sugar Network view looses track of available networks but iwlist is fine

Reported by: greenfeld Owned by: dsd
Priority: normal Milestone: 11.3.0
Component: network manager Version: Development build as of this date
Keywords: Cc:
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

On rare occasion (maybe once or twice before, including with 10.1.3 builds), I have seen Sugar's Network view not show any networks other than the Adhoc networks after waking up from an extended suspend of the laptop (but not to the point the screen blanked, although it may have dimmed). The "iwlist eth0 scan" command run manually works correctly, but does not populate the Network view of networks.

This happened once so far with 11.2.0 os14. I'm not quite certain at which level this needs to be debugged, nor how to debug it at the Network Manager level should it occur again.

Attachments

messages (266.5 kB) - added by greenfeld 3 years ago.

Change History

Changed 3 years ago by greenfeld

Possible non-minimal reproduction steps on an XO 1.5 which have worked a few times (at least the SN SHF00000 development system I have with a serial cable active in it):

This may be system specific; erikos was shown this but could not reproduce it on his laptop. Tested on 11.2.0 os14.

  1. Set libertas_debug to 0x6000 with a serial cable plugged in. Set /proc/sys/kernel/sysrq to 1 (enabled) and send a sysrq-trigger request of "9" to set the console log level so you can see the libertas messages.
  2. Open the terminal activity.
  3. With the terminal activity active, press the power button once. The laptop will go into sleep mode.
  4. Make sure the wireless card is fully asleep (serial console is useful for this), then wake the laptop up again with the power button.
  5. Quickly switch to the Network View and connect to Ad-Hoc network 1. Keep the laptop awake by moving the mouse, etc. The Network view ideally will not refresh to show nearby access points, although you may see the wireless LED periodically blink for scans. <br> {If you are at the Network view at wakeup time, a refresh automatically happens which does not seem to hit this error, so you have to be at the Terminal activity or elsewhere when falling asleep.}
  6. Run "iwlist scan" as root in the Terminal activity. You will see nearby access points listed. But the Sugar Network view of the network access points will not be updated.
  1. Disconnect from the ad-hoc network.
  2. Wait until the laptop's screen dims a bit (but does not go dark), then wake it up.
  3. With luck within a minute watching the Network view while keeping the laptop awake. You ideally will see the nearby available access points again.

Changed 3 years ago by greenfeld

May or may not be suspend/resume related; see attached /var/log/messages:

Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> (eth0): avahi-autoipd timed out. Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Timeout) scheduled... Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Timeout) started... Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> (eth0): device state change: 7 -> 9 (reason 5) Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <warn> Activation (eth0) failed for access point (Ad-hoc Network 1) Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> Marking connection 'Ad-hoc Network 1' invalid. Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <warn> Activation (eth0) failed. Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Timeout) complete. Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> (eth0): device state change: 9 -> 3 (reason 0) Mar 31 01:39:09 xo-3b-3d-76 NetworkManager[1507]: <info> (eth0): deactivating device (reason: 0). Mar 31 01:39:11 xo-3b-3d-76 kernel: [ 294.078805] PM: Syncing filesystems ... done.

The wifi card may have a bit of a race between sleeping and scanning, but I saw a case in the log of this without this problem occurring. I don't know if this works fine but just wasn't seen before because libertas_debug was set:

Mar 31 01:31:57 xo-3b-3d-76 kernel: [ 266.077173] viafb_suspend! Mar 31 01:31:57 xo-3b-3d-76 kernel: [ 266.091600] viafb 0000:00:01.0: PCI INT A disabled Mar 31 01:31:57 xo-3b-3d-76 kernel: [ 266.096723] libertas: Suspend without wake params -- powering down card. Mar 31 01:31:57 xo-3b-3d-76 kernel: [ 266.190049] libertas: SCAN_CMD failed Mar 31 01:31:57 xo-3b-3d-76 kernel: [ 266.361536] mmc1: card 0001 removed

Changed 3 years ago by greenfeld

Changed 3 years ago by dsd

If you reproduce this again, please post more logs.

Looking at the log that has been posted... I know this is going back to March, but can you remember the details of what happened there? I clearly see NetworkManager becoming unhappy as you highlight, but it does look like the system was put into suspend soon after, which caused complete reinitialisation of the network card meaning that networks would have then started appearing again.

In other words, can this bug be "cleared" with a power-button suspend after it occurs?

Nevertheless, it looks like NetworkManager gets confused when going into suspend while in the middle of connecting to an ad-hoc network. We should avoid suspend when NM is connecting, but also perhaps NM could be more resilient here.

Changed 3 years ago by dsd

Testing NetworkManager-0.8.4-1.fc14.i686 on 11.2.0 candidate 870, I can't cause bad things to happen with the specific scenario of forcing a wifi-powered suspend while connecting to adhoc. On resume, NM reports an avahi timeout as in your logs, and "deactivates" the device, but everything remains functional otherwise. The neighborhood view remains populated and working, and NM continues its periodic scanning routine.

Various NM bugfixes have been added to our build (via fedora) in the last few months so there is a chance this has been fixed already.

Next steps are to gather more logs from when this occurs, and when I have time I will run the same test on build 14, and also try the less trivial test routine that you suggest.

Changed 3 years ago by greenfeld

Just a quick weekend clarification:

This is definitely still going on, and reproduces at a frequency around that of the libertas issue.

Something I've found which (if I recall correctly) "clears" the issue is using iwconfig to set a desired AP manually. This causes the Network view to see an icon for the network, followed by Sugar immediately connecting and switching the XO to an Ad-hoc network. I then connect back to the desired network, and the view starts filling up with APs seen in the local area as NM runs scans.

I will try to get more logs if I see it later this week. I don't know if this is related to #10386 or not.

Changed 3 years ago by dsd

I can reproduce this on XO-1, 11.3.0 build 1, with:

  • Boot
  • Switch to neighborhood view
  • Over serial: dmesg -n 8 ; echo 0x1a3 > /sys/module/libertas/parameters/libertas_debug
  • Power button suspend
  • Power button resume
  • No nets appear in Sugar, but debug messages show them being detected, and iwlist can list them

Changed 3 years ago by dsd

  • next_action changed from diagnose to add to build

This is a race. libertas is announcing the scan-results-available event before recording the fact that scanning has completed. If userspace tries to retrieve scan results after that event has arrived, but before the driver has marked the scan as completed, the scan results will not be presented (on the incorrect pretense that the scan is still ongoing).

It was much easier to reproduce this with debugging enabled, as some debug messages are printed within the race window, meaning that the window is wider.

Fixed with http://dev.laptop.org/git/olpc-2.6/commit/?h=olpc-2.6.35&id=66488fe0e4b6d98a8a525cc509b293e29e9be645

This bug is already fixed in the upstream kernel, where the scan results and event handling is now handled by an upper layer.

Changed 3 years ago by dsd

  • next_action changed from add to build to test in build

test in 11.3.0 build 2

Changed 3 years ago by greenfeld

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

This appears to be fixed under normal conditions as well as with the reproduction steps above provided by dsd in 11.3.0 os3.

Changed 7 months ago by Quozl

Reviewing the comments in this ticket, it would appear that #12757 was likely also a contributing cause to scan results not including access points.

Note: See TracTickets for help on using tickets.