Ticket #8592 (closed defect: worksforme)

Opened 6 years ago

Last modified 3 years ago

bug or feature? XO hogs CPU trying to find Mesh while already connected to Access Point

Reported by: thomaswamm Owned by: marco
Priority: high Milestone: 9.1.0-cancelled
Component: sugar Version: Development build as of this date
Keywords: relnote blocks-:8.2.0 cjbfor9.1.0 Cc: mstone, marco, mtd, gregorio, dsd
Action Needed: diagnose Verified: no
Deployments affected: Blocked By: #4074, #6944
Blocking:

Description

Testing beta build 8.2-760. Found a situation that hogs CPU time. There seem to be multiple variations, but I will try to simplify:

When I resume from suspend (using power button), my XO tries first to find a mesh. I am impatient to get back online, so in Neighborhood view I click on my WiFi Access Point, which then connects within a few seconds. The frame then shows both the AP icon (connected) and the Mesh icon (still searching for a connection, and pulsating for hours). Is this a bug, or a feature?

This situation correlates with increased & prolonged CPU usage by sugar-shell, shown by 'top' in Terminal. Even more, multiple repetitions of (disconnect AP, then connect AP) will increase CPU usage by sugar-shell even more, up to around 50%, which makes the XO more sluggish.

I also got into a situation (maybe by double-clicking my Access Point icon) where both AP icon and Mesh icon were both pulsating grey and trying to connect. It sat like that for 2 hours (no change), and 'top' showed 75% CPU usage by sugar-shell. Then with some trial & error clicking it went almost back to normal (only 10% CPU) and AP connected normally.

Logs and screenshots coming. This ticket might help explain #8098.

By the way, normal behaviour for my XO is to be only connected to my AP (and not trying to also connect to Mesh). And sugar-shell normally only uses a few % of CPU. 'Normal' logs also coming.

Now that I understand the behaviour, I can avoid this bug (or feature?). I don't know if it will be a nuisance for other users (G1G1 or schools). It can waste CPU cycles.

Sugar might be wasting CPU cycles while misrepresenting what Network Manager is doing (but I am usually wrong).

Attachments

logs.CSN7500144C.2008-09-20.19-35-08.tar.bz2 (192.9 kB) - added by thomaswamm 6 years ago.
olpc-log for sugar-shell hogging 75% CPU while trying to connect to AP and Mesh
top-cb-with-hog.log (17.3 kB) - added by thomaswamm 6 years ago.
$ top -cb > top-cb-with-hog.log showing 75% CPU for sugar-shell
logs.CSN7500144C.2008-09-20.21-07-24.tar.bz2 (176.3 kB) - added by thomaswamm 6 years ago.
olpc-log baseline for normal behaviour of my XO
top-cb-no-hog.log (19.2 kB) - added by thomaswamm 6 years ago.
$ top -cb > top-cb-no-hog.log for normal good behaviour
tmpGUnE6n.png (36.8 kB) - added by thomaswamm 6 years ago.
Screenshot - bug or feature: looking for School Mesh Portal while already connected to an Access Point
tmpLbZ3Yn.png (113.0 kB) - added by thomaswamm 6 years ago.
Screenshot - sugar-shell hogging 55% CPU
tmpVoTS4K.png (126.0 kB) - added by thomaswamm 6 years ago.
Screenshot - sugar-shell uses 77% CPU for 2 hours while trying to connect to A.P. and mesh
tmpga4Sr6.png (110.2 kB) - added by thomaswamm 6 years ago.
Screenshot - normal performance & behaviour (compare to when hog is seen)
top-bc_45NM_45Sugar.log (25.5 kB) - added by thomaswamm 6 years ago.
output of #top -cb while NM & Sugar together hog 90% of CPU accomplishing nothing.
strace.log.tar.bz2 (36.5 kB) - added by thomaswamm 6 years ago.
$ sudo strace -p <pid of sugar-shell> >& strace.log (run during test case)
tmp0URcyg (1.1 kB) - added by mtd 6 years ago.
patch to fix mesh and AP showing up in frame if mesh was searched for first

Change History

Changed 6 years ago by thomaswamm

olpc-log for sugar-shell hogging 75% CPU while trying to connect to AP and Mesh

Changed 6 years ago by thomaswamm

$ top -cb > top-cb-with-hog.log showing 75% CPU for sugar-shell

Changed 6 years ago by thomaswamm

olpc-log baseline for normal behaviour of my XO

Changed 6 years ago by thomaswamm

$ top -cb > top-cb-no-hog.log for normal good behaviour

  Changed 6 years ago by cjb

  • cc mstone, marco added
  • owner changed from cjb to marco
  • component changed from performance to sugar

Interesting, thanks for the report. It does indeed look like Sugar is having trouble understanding the NetworkManager state. Seems like we should have a timeout on the pulsating, at least..

Changed 6 years ago by thomaswamm

Screenshot - bug or feature: looking for School Mesh Portal while already connected to an Access Point

Changed 6 years ago by thomaswamm

Screenshot - sugar-shell hogging 55% CPU

Changed 6 years ago by thomaswamm

Screenshot - sugar-shell uses 77% CPU for 2 hours while trying to connect to A.P. and mesh

Changed 6 years ago by thomaswamm

Screenshot - normal performance & behaviour (compare to when hog is seen)

Changed 6 years ago by thomaswamm

output of #top -cb while NM & Sugar together hog 90% of CPU accomplishing nothing.

  Changed 6 years ago by thomaswamm

In another variation, after I clicked on my Access Point icon maybe too much, the frame showed persistent grey AP and Mesh icons, and 'top' showed Network Manager and sugar-shell each taking about 45% CPU, forever. XO was extremely sluggish, taking about a minute to respond to some user inputs. For example, a tap of the power button was followed about a minute later by the XO finally going into suspend. Surprisingly, cursor response was good (unlike the normal 'slows'). On resume from suspend, the CPU hogging continued! I eventually used Restart to get back to normal.

The bottom line is, I think, that I should not be childishly mean to my XO by clicking too often on certain icons in Neighborhood view.

Changed 6 years ago by thomaswamm

$ sudo strace -p <pid of sugar-shell> >& strace.log (run during test case)

  Changed 6 years ago by thomaswamm

I attached an strace of sugar-shell which ran for a minute or two while I did the following test:

  1. in Terminal, used $ top -c to find pid of sugar-shell (1222)
  2. in Terminal, run: $ sudo strace -p 1222 >& strace.log
  3. tapped power button to suspend XO (and interrupt wlan)
  4. tapped power button to resume
  5. went to Neighborhood view and clicked on my AP as soon as it showed up
  6. after AP connected, verified in frame that both AP and Mesh icons showed (AP colored, Mesh still grey)
  7. returned to Terminal to ctrl-c the strace

The strace.log is over 1 megabyte long, and incomprehensible by me. I did $ tar -cvjf strace.log.tar.bz2 strace.log

The test resulted in sugar-shell hogging about 15% of CPU. Later I did suspend/resume again, but let the XO automagically refind my AP. Then sugar-shell went down to about 3% CPU.

  Changed 6 years ago by marco

  • cc mtd added

mtd, maybe you could have a look to this one? You are our resident nm expert now :)

  Changed 6 years ago by marco

  • next_action changed from reproduce to diagnose
  • milestone changed from Not Triaged to 8.2.1

  Changed 6 years ago by mtd

Yeah, I see this all the time. My suspicion is that it's basically the same problem as #6944, manifesting itself slightly differently. I believe #6995 will have a fix for it.

  Changed 6 years ago by mtd

  • blockedby 6995 added

  Changed 6 years ago by gregorio

  • keywords relnote added

  Changed 6 years ago by mtd

relnote sucks :(

Attached is a patch that fixes the issue reported. It's better than relnote :). It's not as good as it could be, but that's #6995.

Changed 6 years ago by mtd

patch to fix mesh and AP showing up in frame if mesh was searched for first

follow-up: ↓ 11   Changed 6 years ago by thomaswamm

The symptom of AP and mesh icons both showing in frame was previously reported in #4074, so this might be a regression.

in reply to: ↑ 10   Changed 6 years ago by mtd

  • blockedby 4074, 6944 added; 6995 removed

Replying to thomaswamm:

The symptom of AP and mesh icons both showing in frame was previously reported in #4074, so this might be a regression.

I don't think #4074 was ever really fixed; consider #6944 (and also #5459, #6872).

I'll update #4074. Perhaps we should close this as a dupe.

follow-up: ↓ 13   Changed 6 years ago by gregorio

  • cc gregorio added

Hi Guys,

Is this closed in 765 or does it need to go in the release notes?

If we need a release note, please me write a short user friendly blurb (e.g. when the XO wakes from being suspended it will search for a Mesh connection (e.g. another XO) if that is not found it will try to connect to the last AP that was in use. Please wait until it tries to connect to that AP manually connecting to an AP in the Neighborhood view). Sorry for the nested "e.g."s, hopefully you get the idea.

Thanks,

Greg S

in reply to: ↑ 12   Changed 6 years ago by mtd

Replying to gregorio:

Hi Guys, Is this closed in 765[?]

It's not closed in 765.

...does it need to go in the release notes?

Yes. But as the patch suggester and general person-without-responsibility, can we just fix it? The patch addresses one of the most common symptoms of the problem (as the full problem is certainly beyond the scope of 8.2.0).

If we need a release note, please me write a short user friendly blurb (e.g. when the XO wakes from being suspended it will search for a Mesh connection (e.g. another XO) if that is not found it will try to connect to the last AP that was in use. Please wait until it tries to connect to that AP manually connecting to an AP in the Neighborhood view). Sorry for the nested "e.g."s, hopefully you get the idea.

I get the idea, but who gets to write the blurb?

follow-up: ↓ 15   Changed 6 years ago by gregorio

Hi mtd,

Anyone who knows what this means to end users can write the blurb. I can't tell from the thread what to say. If someone who understand this can get it started I will edit and make it more user friendly before pasting it in the release notes.

Thanks,

Greg S

in reply to: ↑ 14   Changed 6 years ago by mtd

Replying to gregorio:

Anyone who knows what this means to end users can write the blurb.

I think you did a really good job; here's my stab at it (sorry, I did not have enough time to make it shorter, etc.)

Please wait for the network to reconnect when waking up: when the XO wakes from being suspended it will search for a Mesh connection (e.g. another XO) before trying to reconnect to the last AP that was in use. Please wait until it tries to connect to that AP before trying to manually connect to an AP in the Neighborhood view. If you do not and click on the AP icon before the mesh icons stop pulsing, one mesh icon will continue to pulse forever (or until you click on a mesh icon, wait for it and the other mesh icons to stop pulsing, and click on the desired AP again).

Thanks, Greg S

Martin

  Changed 6 years ago by mtd

  • keywords blocks?:8.2.0 r? added

I think it's worth formally asking for a patch review and blocker triage, as this problem will affect all G1G1 people that use an AP and are too impatient to wait 60s after their XO wakes up / starts up. I think that percentage will be high enough to warrant the additional risk of testing this patch, based on the number of bugs this behaviour has elicited (five, by my count, not counting thomaswann's other one).

r? please, and blocks?:8.2.0

  Changed 6 years ago by gregorio

Hi Guys,

Sorry we can't hold the release for this. Thanks for the write up for the release notes.

Definitely want to fix it but it has to be 8.2.1.

Thanks,

Greg S

  Changed 6 years ago by mtd

  • keywords blocks-:8.2.0 added; blocks?:8.2.0 removed

  Changed 6 years ago by thomaswamm

  • blocking 8098 removed

#8098 is unrelated to this bug.

  Changed 5 years ago by marco

  • keywords r? removed

The mesh code on master will have to be rewritten :/ We currently don't support mesh at all.

It's unclear if we will make a 8.2.1 at all at this point, so I'm cancelling review. Please readd r? if/when we decide to make a 8.2.1. Sorry that it turned out this way, but the API changes in NM made a rewrite unavoidable :(

  Changed 5 years ago by mstone-xmlrpc

  • keywords cjbfor9.1.0 added
  • milestone changed from 8.2.1 to 9.1.0

Pushing out to 9.1.0, per edmcnierney's request.

  Changed 3 years ago by dsd

  • cc dsd added
  • status changed from new to closed
  • resolution set to worksforme

Assuming fixed - we've since had rewrites and fixes in this area, and things seem to be working correctly from my perspective. If not, this should be filed on http://bugs.sugarlabs.org

Note: See TracTickets for help on using tickets.