Opened 2 years ago

Closed 23 months ago

#12453 closed defect (fixed)

[CL4] System randomly hangs up at big 01.

Reported by: tomyin Owned by: cjb
Priority: blocker Milestone: 13.1.0
Component: not assigned Version: not specified
Keywords: Cc: pgf, wad, wmb@…, dsd
Blocked By: Blocking:
Deployments affected: Action Needed: no action
Verified: no

Description

OS: 31022o4
OFW: Q7B10
EC: 0.3.07
Procedure:
System randomly hangs up at big 01.

  1. Update to q7b11, reboot
  2. To ok prompt enter “update-nn-flash” then reboot
  3. Move cursor via touchpad to paint
  4. Open paint via touchscreen, ==> hang up then appear 01

No logs can catch

Attachments (1)

big01.jpg (592.5 KB) - added by tomyin 2 years ago.
big01

Download all attachments as: .zip

Change History (13)

Changed 2 years ago by tomyin

big01

comment:1 Changed 2 years ago by tomyin

1.1 Goto sugar
1.2 Normally operate.
1.3 It randomly hangs up at big 01

2.1 Goto sugar and idle the machine let it enter suspend mode
2.2 Via touchpad to wake up machine after panel turn off. ==> hang up then appear 01

comment:2 Changed 2 years ago by dsd

  • Action Needed changed from never set to diagnose
  • Cc pgf wad wmb@… added
  • Milestone changed from Not Triaged to 13.1.0
  • Owner set to cjb

I've seen this a handful of times as well, and have seen it mentioned on IRC. Can't see an existing ticket for it though. Some more reports/logs are in #12458.

comment:3 Changed 2 years ago by dsd

Walter in #12471 has managed to reproduce it quite easily in Browse, even with power management disabled.

comment:4 Changed 2 years ago by dsd

#12433 suggests that various mmc errors are printed at the time of crash. #12486 suggests that opening the sugar frame by moving the mouse to a hot corner may be a likely way to trigger the issue.

comment:5 Changed 2 years ago by wmb@…

The reproduction recipe in #12486 stopped failing after gonzalo opened up the machine and connected a serial port. He said that it is the first time that he has removed the battery in a long time, but that he has updated the EC code since the last battery removal.

comment:6 Changed 2 years ago by walter

Got this twice tonight (once while in Measure, and once while trying to access the Journal from the Home View). Nothing of interest in the logs.

comment:7 Changed 2 years ago by dsd

  • Cc dsd added
  • Priority changed from normal to blocker

This will block XO-4 production if it affects runin. Even if it doesn't, it should still be treated with importance.

comment:8 Changed 2 years ago by wmb@…

The problem has been tracked down to corruption of CForth's interrupt stack, specifically the saved PC value. Moving the interrupt stack from SRAM to TCM works around the problem. The root cause of the corruption is as yet undetermined. It could be hardware, or a bug in the CForth interrupt handling code, or a bad setting for the suspend/resume parameters relating to SRAM, or a driver bug that causes writing to some SRAM locations that are not owned by the driver.

comment:9 Changed 2 years ago by wmb@…

The workaround is encoded in CForth git commit aeea08d.

comment:10 Changed 2 years ago by dsd

...and released in XO-4 firmware Q7B12.

comment:11 Changed 23 months ago by dsd

  • Action Needed changed from diagnose to test in build

This workaround can be tested in 13.1.0 build 27.

comment:12 Changed 23 months ago by greenfeld

  • Action Needed changed from test in build to no action
  • Resolution set to fixed
  • Status changed from new to closed

I have not seen any 01 failure reboots over the past few days with Q7B12 (os25/26) & Q7B14 (os27).

Note: See TracTickets for help on using tickets.