Ticket #12660 (closed defect: fixed)

Opened 16 months ago

Last modified 15 months ago

XO-1.5 time reset problem with year

Reported by: dsd Owned by: Quozl
Priority: normal Milestone: 13.2.0
Component: ofw - open firmware Version: not specified
Keywords: Cc:
Action Needed: test in build Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Testing a XO-1.5 with q3c13:

  1. Shut down, remove all power sources
  2. Remove RTC battery for about a minute, then reconnect again
  3. Boot. Firmware reports year 6500, and boots into the activation initramfs on the basis that the lease expired back in the year 2013 :)

Expected behaviour was the clock to be reset to a sane past value and for the system to boot into the normal runos/runrd environment.

Change History

Changed 16 months ago by Quozl

  • next_action changed from never set to design

The cause of the problem is that the battery dead bit is not reliable, and the year 6500 is from the reset value of the century register; 0xff.

Regarding the battery dead bit:

  • the chipset provides a clock that behaves like a DS1385,
  • the battery dead bit on the DS1385 and the chipset is not latching,
  • in other designs, an external non-rechargeable battery is attached,
  • our XO-1.5 provides both clock battery and main bus power to the clock power input, so at the time we sample the battery dead bit the clock power input is good,
  • therefore the battery dead bit is not useful for detecting loss of clock data.

A test of an XO-1.5 with no clock battery, with varying power off times was run, from between 1 and 64 seconds, capturing the clock registers at power up. 200 cycles were done.

Several behaviours were found to correlate to power loss:

  • the date and time resets to 01-01-01 01:01:01, and continues to count, (after 5 seconds and before 22 seconds without power),
  • the date and time resets to 00-00-00 00:00:00, and may continue to count, the divider configuration will clear, the interrupt configuration may clear, and the 24/12 hour format register bit will clear, (after 22 seconds without power),
  • the day of week register may be set to 0x10, (after 26 seconds without power).

It should be possible to detect power loss based on these behaviours.

Changed 16 months ago by Quozl

Tested XO-1:

  • the chipset provides a clock that behaves like a DS1385,
  • the battery dead bit on the chipset is latching,
  • Open Firmware says at boot "The time has not been set since the real-time clock battery was replaced".

Therefore this is not a problem with XO-1.

Changed 16 months ago by Quozl

  • next_action changed from design to package

The test was extended to several thousand cycles and the results correlated.

Fixed as much as possible in svn 3649. This should prevent the year 6500 symptom, and several others, but not all possible symptoms. A brick tested build is q3c13jb.rom.

The XO-1.5 cannot detect loss of RTC data using the RTC itself. RTC anti-rollback should be deployed in situations where RTC data is critical.

Changed 16 months ago by dsd

Would it be sensible/possible to add an "insurance" mechanism: if the RTC data fails the obvious tests (hour between 0 and 23, month between 1 and 12, etc), reset the clock?

Changed 16 months ago by Quozl

A risk with any "insurance" mechanisms is false positives. For instance, the day of week register is not set by Open Firmware, but is set by Linux, so if we checked for valid day of week we might reset the clock on laptops that have not yet had their time set by Linux.

The tests you mention are in svn 3649 already.

The remaining undetectable corruption is possibly due to the shadow register copy triggering when voltage is nearly gone. It causes values from the least significant registers to be copied verbatim into the most significant registers. When the bit pattern is indistinguishable from a valid date and time, the code cannot detect the corruption. To illustrate, here are the summary results of my test "run-d": http://dev.laptop.org/~quozl/z/1UWwNt.txt

The columns are:

  • duration of power down, in seconds,
  • the time_t of the test in my raw logs,
  • the time taken to boot, in seconds,
  • the first 16 register values in the RTC, after the amnesia check is done, (seconds, alarm seconds, minutes, alarm minutes, hours, alarm hours, day of week, day of month, month, year, register a, b, c, d, first two SRAM locations),
  • a series of analysis tags; reinit means the amnesia test triggered and the date and time was reset, bad-dt means the date and time are not valid (they were set to a specific value unrelated to the time_t), and bad-sram means the bit pattern in the first two SRAM locations had rotted.

Observe how the duration of the power down affects the corruption. This test was without a battery, but the same sort of thing is expected to occur with a discharged battery, over a longer time period.

Changed 16 months ago by Quozl

  • next_action changed from package to add to build

Fixed in Q3C14.

Changed 16 months ago by dsd

  • next_action changed from add to build to test in build

Test in 13.2.0 build 5

Changed 15 months ago by dsd

  • status changed from new to closed
  • resolution set to fixed

Works in 13.2.0 build 8 on XO-1.5 with the shipped firmware, repeating the original test the system boots with the date reset to 01/01/2013.

Note: See TracTickets for help on using tickets.