Ticket #11600 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

Hang w. OS26

Reported by: wad Owned by:
Priority: blocker Milestone: 1.75-software
Component: kernel Version: Development build as of this date
Keywords: XO-1.75, suspend Cc: martin.langhoff
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

In upgrading laptops to use OS26, I had two laptops (ramp SKU203 and ramp SKU204) hang after the first boot to sugar. Both laptops were running Q4D02 and os25. They were fs-updated to os26.zd4, their TS tag changed to SHIP, and booted. They were both given the name "test" and a color selected. My next step is to select a wifi AP --- I believe I had done this on both laptops --- then return them to the sugar home screen.

The next time I looked at them, the mouse was unresponsive. There was also no response to the keyboard. As aggressive suspend/resume is enabled on these laptops, I believe they were suspended and hung on resume when woken up. The power LED is on.

Connecting up a serial console, neither laptop responded to a BREAK BREAK <CR>.

After rebooting the laptops, I cannot reproduce the problem. Reinstalling OS26 on them doesn't reproduce it either. I have been installing OS26 on a number of other laptops, and haven't seen a repeat.

Change History

  Changed 2 years ago by Quozl

  • priority changed from normal to high
  • next_action changed from reproduce to diagnose
  • milestone changed from Not Triaged to 1.75-software

Attempted to reproduce. Installed os26 on B1 B1 B4 C1 SKU200 SKU201 SKU202 C2 SKU203 SKU204, following the sequence above. Then the units were allowed to suspend, being woken with touchpad. One unit, the C1 SKU201, reproduced the symptom on the third touchpad wakeup, with the power LED on.

  Changed 2 years ago by martin.langhoff

  • component changed from not assigned to kernel

  Changed 2 years ago by martin.langhoff

  • cc martin.langhoff added
  • priority changed from high to blocker

This is present on OS27, and OS27 with a newer kernel (4a6b24ff528b7767a574f5ec8d2d09a5f7da0fbd, same as OS28). My steps to repro seem to be: boot, open a Sugar Terminal session, let it idle.

  • Seems to affect units when aggressive suspend kicks in, driven by powerd
  • When frozen, power LED is on, solid
  • Units were associated to an AP, using WPA2
  • echo 0 > /sys/power/pm_async changes the behaviour slightly: camera light is on when the unit is frozen
  • Unclear whether it happens on they way down, or on the way up
  • Attempts to repro with WOL wakeups (using ping) didn't seem to trigger it

Sam has more notes on this, I believe.

  Changed 2 years ago by martin.langhoff

Playing with OS28,

  • Two sessions of playing around with Record ended up with a frozen unit. One froze when closing Record (after 3 successful video captures), one froze when stopping recording. In both cases, the camera light is on.
  • Two sessions of having siv120d ov7076 and mmp-camera blacklisted, using activities that don't involve the camera (Speak, Browse, Terminal) did not see freezes. Machine was allowed to idle into suspend.

Unit was associated to an AP w WPA2, no other XOs on the network.

  Changed 2 years ago by greenfeld

Replying to martin.langhoff:

* echo 0 > /sys/power/pm_async changes the behaviour slightly: camera light is on when the unit is frozen

This could just be changing where the hang occurs in the resume cycle with #11644, which flashes the camera LED briefly during a successful resume cycle.

Hangs at the end of the recording cycle in Record have been previously seen but they only hung Record, not the rest of the OS. These were somewhat rare though.

I have setup a B1 with a simple xv pipeline to try various ways of repeatedly turning off & on the camera in an attempt to see if there is a minimal way to reproduce this.

  Changed 2 years ago by martin.langhoff

Sam - I agree, that's why I found it interesting.

In any case, I've blacklisted siv120d and haven't seen a hang yet.

follow-up: ↓ 10   Changed 2 years ago by wad

Hate to burst a bubble here, but the camera hasn't been in use in the 11600 crashes I've seen.

follow-up: ↓ 9   Changed 2 years ago by greenfeld

I agree it doesn't seem necessary to use the camera to cause a hang, but it might be an aggravating factor.

  • Using a script to running a 5 second on/off dortc loop after launching an X Window for "gst-launch v4l2src ! xvimagesink &" in the foreground hung both times I ran it within the first hour after starting the test. The hang occurred with the power LED on but the camera LED off. I did not enable serial, watchdog, or blacklisting any modules while doing this. The configuration should have been like os28 freshly imaged.
  • Running the camera for a few hours with the above command without a dortc loop did not cause a hang, or visibly cause the XO to suspend.
  • Running the gst-launch command for 5 seconds and then killing it off for 5 seconds did not cause a hang when I tested it for an hour, but powerd didn't see this as enough activity and put the XO to sleep. Given we watch camera activity now this might be a bug on its own, but I used olpc-nosleep to workaround it.
  • Running the same dortc loop without the camera running hung once so far, but took a few hours to reach that point. The dortc-loop test hung shortly after printing for a few cycles that the rtc device was already in use. I am now using olpc-nosleep to verify that powerd doesn't kick in and conflict with the script, but what else uses the rtc device?

All scripts were run from the Terminal activity in Sugar as the root user on a B1 HS unit. The XO was then left untouched since switching views in Sugar risked putting sugar-session into a CPU loop until the gst-launch window was closed (which probably should be a ticket on its own).

in reply to: ↑ 8   Changed 2 years ago by Quozl

Replying to greenfeld:

I agree it doesn't seem necessary to use the camera to cause a hang, but it might be an aggravating factor.

Yes, it seems to correlate, without being a cause. Much like how the serial port was identified as a correlate of hangs in earlier testing.

in reply to: ↑ 7   Changed 2 years ago by erikos

Replying to wad:

Hate to burst a bubble here, but the camera hasn't been in use in the 11600 crashes I've seen.

I can affirm that. My 1.75 hang here with os27 by having Paint open and leaving the machine idle for a moment. No activity wich acceses media input devices was involved.

  Changed 2 years ago by martin.langhoff

No bubble to burst -- nobody's claiming that _using_ the camera triggers the hangs.

The evidence I have in hand right now seems to incriminate the camera driver. I think that the issue is that both siv120d and ov7670 are loaded.

Please help me explore this by blacklisting siv120d with: echo blacklist siv120d > /etc/modprobe.d/suspicious_siv120d.conf ; reboot

If this leads to a noticeably more stable user experience, the short-term fix is clear( blacklist), the right fix is less clear to me.

  Changed 2 years ago by martin.langhoff

The S/R + WOL crash, which may also be a player in the cases discussed here, is documented and tracked at #11658 .

  Changed 2 years ago by martin.langhoff

  • next_action changed from diagnose to test in build

This seems to clealy be #11658 -- if we see no S/R crashes in OS30, time to close.

  Changed 2 years ago by greenfeld

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed

I personally have not encountered any hangs due to WOL while testing using an XO-1.75 on various networks. These networks ranged from quiet ones to those which woke up the XO almost as soon as it went to sleep.

There however may be still be a data race reported by some users which may lead to WOL hangs, and this may still need to be resolved.

Tested with 11.3.1 os30.

Note: See TracTickets for help on using tickets.