Opened 3 years ago

Closed 3 years ago

#11054 closed defect (fixed)

EC SCI mask not written correctly when going into suspend

Reported by: dsd Owned by: pgf
Priority: normal Milestone: Future Release
Component: embedded controller Version: not specified
Keywords: Cc: rsmith
Blocked By: Blocking:
Deployments affected: Action Needed: never set
Verified: no

Description

In the XO power management code that has gone upstream, we listen for all EC events when the system is running, but we only listen for EC events related to devices that have wakeups requested on them when going into suspend.

For example, during regular system operation we write a SCI mask of EC_SCI_SRC_ALL (0x1ff), but when going into suspend, if the only device that has wakeups requested is the keyboard/mouse, we write a SCI mask of EC_SCI_SRC_GAME (0x01) only.

I am finding that when the suspend SCI mask is written late during the suspend process, it causes events to be ignored that shouldn't be. It is as if the mask is being programmed incorrectly, or is being programmed as all-zeroes. The system simply doesn't wake up when the events occur. However, if I run another EC command (a dummy one) immediately after, the mask is programmed correctly. Does this highlight a strange timing condition?

Note that under this new infrastructure, the EC mask is programmed *really* late when we go into suspend.

Here is how you can reproduce this. Sorry that it is a bit long winded right now.

Once booted, request that keyboard/mouse wakeups are enabled, and suspend:

 echo -n enabled > /sys/devices/platform/i8042/power/wakeup                      
 echo mem > /sys/power/state

Try to wake the system with keyboard/mouse - it won't wake up, but it should have done.

Now apply this patch: http://dev.laptop.org/~dsd/20110709/0001-run-a-dummy-command-after-setting-wakeup-mask.patch

Repeat the test, and you'll find that keyboard/mouse wakeups work fine.

Attachments (1)

wait_for_ec_data.patch (1.1 KB) - added by pgf 3 years ago.

Download all attachments as: .zip

Change History (12)

comment:1 Changed 3 years ago by dsd

reproduced with Q3B13 EC Firmware Ver:2.2.60 on XO-1.5

comment:2 Changed 3 years ago by dsd

Seems to be timing-related.
Instead of the dummy command, inserting mdelay(5) appears to fix the problem 100% of the time.
mdelay(1) appears to fix the problem about 25% of the time.

Changed 3 years ago by pgf

comment:3 Changed 3 years ago by pgf

it seems that when transferring argument bytes after a command opcode, we've never waited for the last byte to be transferred. the attached patch causes the kernel to wait for the EC to accept all bytes before continuing.

comment:4 Changed 3 years ago by pgf

by the way, the EC serial output was practically enough to debug this (with access to the source, of course ;-). for the record:

...
483520488:MAIN_ON

483520978:-22,
483520995:-2f,
483521015:PwrUp Done
483522098:-40,
483522104:-47,
483522110:-4c,
483522125:-4e,
483522155:-50,
483522169:-53,
483522228:Battery Poll
483529942:-ff,
SCIMask=1ff
483530714:-00,
483764181:-f3,
SCIMask=00
483764839:S068IBF
483765388:S068IBF
483765932:S068IBF
483766480:S068IBF
483767027:S068IBF
483767571:S068IBF
483768121:S068IBF
483768670:S068IBF
483769218:S068IBF
...

the EC was continuously timing out waiting for the last data byte, until the system resumed and that command was aborted.

comment:5 Changed 3 years ago by dsd

Could this historic relic be of relevance? found in olpc_fixup_sleep() in olpc-pm.c

	/* 
	 * Cmd 0x32 tells the EC that we're going into suspend; this was
	 * added to work around hardware races related to SCI events.  This
	 * should cause the EC to inhibit further SCIs while MAIN_ON is
	 * transitioning low.
	 *
	 * There's also some sort of EC race whereby the EC gets its
	 * IBF/OBF flags confused and all future communication (after
	 * resuming) fails if we suspend too soon after updating
	 * the EC SCI mask.  Having this command after updating the
	 * SCI mask allows the EC enough time to finish doing what it's
	 * doing.
	 */
	return olpc_ec_cmd(EC_SET_SCI_INHIBIT, NULL, 0, NULL, 0);

comment:6 Changed 3 years ago by pgf

interesting observation. it's possible that they're related, though the current symptom clears up on resume. but that could be for some other reason.

comment:7 Changed 3 years ago by dsd

  • Milestone changed from Not Triaged to Future Release

comment:8 follow-up: Changed 3 years ago by dsd

As requested by Richard, here is a bzImage that reproduces the issue: http://dev.laptop.org/~dsd/20110726/bzImage

You will also need to use a hacked initramfs that doesn't use /ofw. Here is mine: http://dev.laptop.org/~dsd/20110221/irfs.img

and my recipe for booting all this off USB: http://wiki.laptop.org/go/Kernel#Boot_your_development_kernel_from_USB

Let me know if you need any help getting it going.

Once running, the test case is:

 echo -n enabled > /sys/devices/platform/i8042/power/wakeup                      
 echo mem > /sys/power/state

Now press a key. This should wake up the system, but (before Paul's patch) doesn't.

comment:9 in reply to: ↑ 8 Changed 3 years ago by rsmith

Replying to dsd:

As requested by Richard, here is a bzImage that reproduces the issue: http://dev.laptop.org/~dsd/20110726/bzImage

I did further research into this issue because the previous explanation did not match what I see in the posted EC log. The EC log shows 'SCIMask=00'. This means that the EC indeed received 2 bytes. The S068IBF message that occur after that line occurs because the ibf flag is stuck. This is something that has always happened on suspend.

The core problem is not a missing 2nd data byte but the fact that its zero rather than 01. After debugging further what I believe is happening is that we don't allow enough time to pass between writing the last byte and when the bus power turns off for suspend. Suspend drops the power to the LPC bus. So the EC reads the value but its zero rather than 01 and the LPC bridge goes into a wacky mode until power is restored on a resume.

This is is supported by the fact that you can also make things work by adding a mdelay() call after the call to olpc_ec_cmd().

Paul's patch is exactly the right fix. With the patch we now wait for the ibf flag to clear which means that we don't turn off the power to the bus anymore while the EC is in the process of reading the byte. ibf is cleared after the byte has been read. This not only fixes the 2nd byte zero problem but also fixes the infinite ibf flag that used to occur when going into suspend.

comment:10 Changed 3 years ago by pgf

sorry you had to spend more time on this -- my previous explanation was probably ambiguous. i should have explained the fix more fully. (i.e., the power-down timing issue, and that i had verified that the patch cured the wait-for-IBF spew.)

comment:11 Changed 3 years ago by dsd

  • Resolution set to fixed
  • Status changed from new to closed

thanks, this is applied upstream

Note: See TracTickets for help on using tickets.