Ticket #9779 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

Wakeup key is repeated

Reported by: cjb Owned by: rsmith
Priority: blocker Milestone: 10.1.0
Component: kernel Version: not specified
Keywords: Cc: dsaxena, pgf, sascha_silbe, rgs, crodas
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

To reproduce:

Let OHM suspend you on the tty, type "ls" quickly to wakeup

Instead of getting ls, we get "lllllll[repeated]". We had this on XO1, too. Richard says he thinks we fixed it with a kernel patch.

Attachments

i8042.c.patch (0.6 kB) - added by pgf 5 years ago.
preliminary patch for i8042 resume()
annotated-log.txt (25.6 kB) - added by pgf 5 years ago.

Change History

  Changed 5 years ago by cjb

Also reproducible by resuming using one of the game keys in X; it will repeat.

  Changed 5 years ago by cjb

#2401 is the old XO1 bug, fixed with an EC change.

  Changed 5 years ago by cjb

Other modern bugs that reference the same effect:

#9777 #9105 #9690

  Changed 5 years ago by cjb

  • cc dsaxena added; rsmith removed
  • owner changed from dsaxena to rsmith

Richard's gonna take a look at the EC keycodes.

  Changed 5 years ago by cjb

  • cc pgf added

Richard says the keyboard controller is being reset on resume, and shouldn't be.

  Changed 5 years ago by Quozl

  • milestone changed from 1.5-software-final to 1.5-software-beta

retriage.

  Changed 5 years ago by dsaxena

Spent some time looking at this and AFAICT, while there has been some refactoring, the suspend/resume paths in the i8042.c, atkbd.c, and serio.c drivers have not changed in behavior. This is difficult to reproduce with the given test of typing 'ls' "really quickly" but I've been able to reproduce via the method in #9777 by running pointing browse to a long webpage and then waking up via the up or down arrow key. As per #9777, the scroll browser will just continously scroll until I press another key. This is not always reproduce-able either but it is more consistent then the 'ls' method as it only requires one rapid keypress.

The only clue I've noticed so far is that when we see this issue, the keyboard device is redetected,the old device goes away, and a new input device is created:

[  142.102501] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input7

One difference between XO-1 and XO-1.5 I noticed from the 802 build is that we're using the uinput driver as part of olpc-kbdshim and I'm wondering about is the possibility of something in that code path causing the reconnect/reset.

  Changed 5 years ago by pgf

the redetection message happens, sometimes, but not all the time.

new info: there is a very high correlation between enabling console printk messages and the problem going away. with no console dmesg output, i can reproduce it almost every time using the 'type a character at a VT screen' method. with full console printk, i almost can't make it happen.

  Changed 5 years ago by pgf

here's the EC log from a failure case:

523860:-F3,
523951:MO L
<suspend, then type 'a' on the keyboard>
Rk0C,U2:K1E,Keyboard wake
535859:SCI:01
535862:MO H
535879:-10,
535894:-1D,
RkF0,Rk0C,535959:-B3,

I60,D70,
I60,D52,
I60,D43,U2:K9E,VF2,RkFA,RkAB,Rk41,U2:KFA,U2:KAB,U2:K41,VED,RkFA,U2:KFA,V00,RkFA,U2:KFA,VF3,RkFA,U2:KFA,V00,RkFA,U2:KFA,VF4,RkFA,U2:KFA,VF3,RkFA,U2:KFA,V00,RkFA,U2:KFA,VED,RkFA,U2:KFA,V00,RkFA,U2:KFA,
ID4,VF2,RpFA,
M1
M2Rp00,
M1
M2U4:
PFA,U4:P00,
ID4,VF6,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V0A,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V3C,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V28,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V3C,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V28,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF2,RpFA,
M1
M2Rp00,
M1
M2U4:PFA,U4:P00,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE9,RpFA,
M1
M2Rp02,
M1
M2Rp47,
M1
M2Rp16,
M1
M2U4:PFA,U4:P02,U4:P47,U4:P16,
ID4,VE6,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V00,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V14,RpFA,
M1
M2U4:PFA,
ID4,VF6,RpFA,
M1
M2U4:PFA,
ID4,VFF,RpFA,
M1
M2U4:PFA,RpAA,
M1
M2Rp00,
M1
M2U4:PAA,U4:P00,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,VC8,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V64,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V50,RpFA,
M1
M2U4:PFA,
ID4,VF2,RpFA,
M1
M2Rp00,
M1
M2U4:PFA,U4:P00,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,VC8,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,VC8,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V50,RpFA,
M1
M2U4:PFA,
ID4,VF2,RpFA,
M1
M2Rp00,
M1
M2U4:PFA,U4:P00,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,VC8,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V64,RpFA,
M1
M2U4:PFA,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V50,RpFA,
M1
M2U4:PFA,
ID4,VF2,RpFA,
M1
M2Rp00,
M1
M2U4:PFA,U4:P00,
ID4,VFF,RpFA,
M1
M2U4:PFA,RpAA,
M1
M2Rp00,
M1
M2U4:PAA,U4:P00,
ID4,VF3,RpFA,
M1
M2U4:PFA,
ID4,V64,RpFA,
M1
M2U4:PFA,
ID4,VE8,RpFA,
M1
M2U4:PFA,
ID4,V03,RpFA,
M1
M2U4:PFA,
ID4,VE6,RpFA,
M1
M2U4:PFA,
ID4,VF4,RpFA,
M1
M2U4:PFA,<typed 'b' here to stop the repeats>Rk20,U2:
K30,RkF0,Rk20,U2:KB0,

  Changed 5 years ago by rsmith

I don't believe this is an EC problem.

Enviroment: XO 1-5, B2, OS 51, My latest test EC code and OFW, echo mem > /sys/power/state via serial port, press 'a' key on XO keyboard

Here is the EC log of the fail case with my comments.

235112:MO L		// Main on goes low S3 
Rk0C,Keyboard wake	// Press 'a' key (Make 0x1e, break 0x9e)
			// Matrix address = 0x0c
238096:SCI:01     	// Wakeup
238099:MO H       	// Come out of S3
K1E,              	// Send make code
238114:-10,       	// Post codes
238130:-1D,       	// Post codes
RkF0,Rk0C,238196:-B3,	// Receive break Matrix address

I60,D70,		// KBC Write CMD byte
			// 0x70 = Xlate, Mouse Dis, KBD Dis
I60,D52,		// 0x52 = Xlate, Mouse En, KBD Dis, Mouse int 
I60,D43,K9E,		// 0x43 = Xlate, Mouse En, KBD En, Mouse int, KBD int
			// Send 0x9e break code.

VF2,RkFA,RkAB,Rk41,KFA,	// Cmd Keyboard ID response 0xab41
KAB,                   	// Send 0xAB
K41,                   	// Send 0x41
VED,RkFA,KFA,          	// Cmd write LED
V00,RkFA,KFA,          	// LED = 0x00
VF3,RkFA,KFA,          	// Cmd Repeat rate
V00,RkFA,KFA,          	// rate = 0x00
VF4,RkFA,KFA,          	// Cmd KBD enable
VF3,RkFA,KFA,          	// Repate rate = 0x00
V00,RkFA,KFA,          
VED,RkFA,KFA,          	// Set LED's = 0x00
V00,RkFA,KFA,          

[The rest of the commands are sent to the touchpad]

Here is the pass case with fewer comments because as you can see they are virtually identical

411334:MO L  
Rk0C,K1E,		// Rx make send Make code   
Keyboard wake
419437:SCI:01
419440:MO H  
419470:-1D,  
RkF0,Rk0C,419539:-A3,	// Rx break 
419542:-B3,          

I60,D70,
I60,D52,
I60,D43,K9E,		// Send make 
VF2,RkFA,RkAB,Rk41,KFA,
KAB,                   
K41,                   
VED,RkFA,KFA,          
V00,RkFA,KFA,          
VF3,RkFA,KFA,          
V00,RkFA,KFA,          
VF4,RkFA,KFA,          
VF3,RkFA,KFA,          
V00,RkFA,KFA,          
VED,RkFA,KFA,          
V00,RkFA,KFA,

In both cases the break code is sent at the same point in the stream. One can argue that sending up the break code in the middle of the commands that Linux is sending is marginal but nothing seems invalid about the above streams.

None of the commands that Linux sends above are necessary. The keyboard is already fully operational at the time the wakeup key is pressed.

Linux however has slightly different view:

Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.492921] libertas_sdio mmc2:0001:1: firmware: requesting sd8686_helper.bin
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.700114] usb 1-2: reset high speed USB device using ehci_hcd and address 2

  // Cmd byte writes start here:

Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.853402] drivers/input/serio/i8042.c: 60 -> i8042 (command) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.854558] drivers/input/serio/i8042.c: 70 -> i8042 (parameter) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.855456] drivers/input/serio/i8042.c: 60 -> i8042 (command) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.856456] drivers/input/serio/i8042.c: 52 -> i8042 (parameter) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.858184] drivers/input/serio/i8042.c: 60 -> i8042 (command) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.859391] drivers/input/serio/i8042.c: 43 -> i8042 (parameter) [169057]

  // Heres the make code
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.859409] drivers/input/serio/i8042.c: 1e <- i8042 (interrupt, 0, 0) [169057]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.859841] PM: Finishing wakeup.
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.859848] Restarting tasks ... 
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.859975] hub 1-0:1.0: over-current change on port 1

  // Start of the ID command.
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.861120] drivers/input/serio/i8042.c: f2 -> i8042 (kbd-data) [169058]

  // Here's the break code.
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.861240] drivers/input/serio/i8042.c: 9e <- i8042 (interrupt, 0, 1) [169058]

Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.871708] drivers/input/serio/i8042.c: fa <- i8042 (interrupt, 0, 1) [169059]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.872798] drivers/input/serio/i8042.c: ab <- i8042 (interrupt, 0, 1) [169059]
Dec  5 11:40:07 xo-73-23-26 kernel: [ 1693.873682] drivers/input/serio/i8042.c: 41 <- i8042 (interrupt, 0, 1) [169059]

Its not clear to me if Linux is expecting the next byte it gets after sending the 0xf2 command to be a 0x9e break code because the success case in Linux has exactly the same log.

  Changed 5 years ago by pgf

i don't understand the ifdef'ed logic in i8042_suspend()/i8042_resume().

it looks to me like, if we're a recent OLPC board, that we'll skip the calls to i8042_controller_reset() and i8042_controller_check(), but we'll still run almost all of i8042_resume(), from i8042_controller_selftest() onward. perhaps some of this is correct, but it seems like we shouldn't need to reset the controller.

follow-up: ↓ 19   Changed 5 years ago by wmb@…

I'm pretty sure that Paul's finding above is correct.

The following block at line 955 of drivers/input/serio/i8042.c should be moved to line 997, so it will be just before the call to i8042_interrupt()

#ifdef CONFIG_OLPC

}

#endif

The commands that will thus be skipped are unnecessary, time consuming, and actively harmful in our case.

  Changed 5 years ago by wmb@…

Just to be explicit, I believe that the code that needs to be skipped per above absolutely will cause the observed false-repeat symptom, depending on an uncontrollable race condition between the OS issuing the commands and the keyboard sending the break event from the other direction.

The i8042_command() routine does not attempt to distinguish between valid command response bytes and interspersed events from the keyboard. It is in fact possible at the protocol level to make the distinction (the value sets for the two cases are disjoint), but would add some complexity to the command response parsing logic (it would have to do a range check and redirect data events to a queue). According to my reading of i8042_command(), it does not make the distinction, instead just counting N incoming bytes and stuffing them in the param[] array. That being the case, i8042_command() *will* eat keyboard events if they happen during a window of vulnerability.

  Changed 5 years ago by wmb@…

1) The underscores above are an artifact of wiki formatting meeting identifier names prefixed with double underscore, not an attempt at emphasis

2) Even if one were to enhance the command routine to distinguish data from command responses, skipping the commands for resume on OLPC would still be the right thing, as they are redundant and time consuming in our case, since the EC and its internal 8042 remain powered and retain state.

  Changed 5 years ago by wmb@…

I think keyboard/atkbd.c/atkbd_reconnect() is the current culprit, now that pgf has taken card of the i8042_command() level stuff.

  Changed 5 years ago by Quozl

Observed a symptom today on os54 ... is it related?

* suspend due to idle, * wake due to keyboard, * F1, F2, F3 and F4 keys did not respond in Sugar, * Ctrl-Alt-F1 worked fine, * pressed enter to start shell, * typing alphabetic keys ... they displayed in uppercase, as if a shift key was down, or a capslock engaged, * pressing the left shift key did not clear the uppercase effect, * pressing the right shift key cleared the effect.

It was as if the system thought the right shift key was still down. This would be a missing key release event.

  Changed 5 years ago by Quozl

(the lack of meaningful layout in my last reply was caused by mixing wiki styles)

  Changed 5 years ago by cjb

Yes; you probably woke up using the right shift key in particular, which would explain everything.

in reply to: ↑ 12   Changed 5 years ago by pgf

Replying to wmb@firmworks.com:

I'm pretty sure that Paul's finding above is correct. The following block at line 955 of drivers/input/serio/i8042.c should be moved to line 997, so it will be just before the call to i8042_interrupt() #ifdef CONFIG_OLPC } #endif The commands that will thus be skipped are unnecessary, time consuming, and actively harmful in our case.

my somewhat cleaner version of this is attached as i8042.c.patch (it does the same thing, but without the distributed ifdef.)

with this patch applied, i find that it becomes marginally harder to reproduce the symptom. (from maybe 95% reproduceable to 70%. still pretty bad.)

i'll attach an annotated (by mitch) dmesg log describing what we're hitting.

Changed 5 years ago by pgf

preliminary patch for i8042 resume()

Changed 5 years ago by pgf

  Changed 5 years ago by pgf

a little more background, from irc:

05-12-2009 18:28:24 [Mitch_Bradley] pgf: the d4 commands are mouse
   parameters

05-12-2009 18:29:58 [smithbone] yeah. write next command to aux device

05-12-2009 18:37:01 [Mitch_Bradley] ...  Did you see my
   annotations?

05-12-2009 18:37:21 [Mitch_Bradley] starting at line 105

05-12-2009 18:37:56 [Mitch_Bradley] actually line 100 is the smoking gun

05-12-2009 18:38:31 [Mitch_Bradley] but all those keyboard commands from
   89 .. 124 really shouldn't be happening for us

05-12-2009 18:38:51 > pgf: ah.  i hadn't seen them.   looking now.

05-12-2009 18:39:11 [Mitch_Bradley] also we should do the same thing
   for the mouse - suppress the re-setup

05-12-2009 18:53:38 > pgf: currently i'm suspicious of serio_resume() forcing a port reconnect.

05-12-2009 18:53:52 > pgf: i wish i had a better picture of the layers
   involved, and who calls who.

05-12-2009 19:05:07 [Mitch_Bradley] pgf: I think
   keyboard/atkbd.c/atkbd_reconnect() is the current culprit

05-12-2009 19:32:35 > pgf: Mitch_Bradley: likely.  i'm wondering why we
   don't see this on XO-1.

05-12-2009 21:24:35 [Mitch_Bradley] pgf: possibly doesn't happen on XO-1
   because of small timing differences that cause the break event not
   to coincide with the command stuff

05-12-2009 21:29:21 [smithbone] pgf: because on XO resume is slower?

  Changed 5 years ago by dsd

OK. I'm not able to reproduce this but I haven't tried *too* hard.

but anyway, I've taken Paul's patch and also tweaked the serio code (which was *always* reconnecting stuff during resume). Now, according to those i8042 dbg() messages that you guys have been looking at, there is no I/O along those paths during suspend, and during resume there is only:

[  305.730174] drivers/input/serio/i8042.c: e0 <- i8042 (interrupt, 0, 0) [30245]
[  305.730971] drivers/input/serio/i8042.c: 4b <- i8042 (interrupt, 0, 1) [30245]
[  305.735667] drivers/input/serio/i8042.c: e0 <- i8042 (interrupt, 0, 1) [30245]
[  305.737258] drivers/input/serio/i8042.c: cb <- i8042 (interrupt, 0, 1) [30245]

Which i presume corresponds to the key I pressed to wake the system up.

So I'll push this out in the next few mins and you guys (who seem to have more luck reproducing this) can try it out...

  Changed 5 years ago by dsd

  • next_action changed from diagnose to test in build

pushed as 395c555e6f927

  Changed 5 years ago by cjb

test in os55

  Changed 5 years ago by Quozl

  • status changed from new to closed
  • resolution set to fixed

Tested on os55 on two XO-1.5's, ten resumes using keyboard, symptom does not occur. Closing.

follow-up: ↓ 30   Changed 5 years ago by sascha_silbe

  • status changed from closed to reopened
  • next_action changed from test in build to design
  • resolution deleted

Reopening as 395c555e6f927 causes keyboard and mouse to not work at all after resume on XO-1.

  Changed 5 years ago by sascha_silbe

  • cc sascha_silbe added

  Changed 5 years ago by cjb

  • milestone changed from 1.5-software-beta to 1.0-software-update

Changing milestone to 1.0-software-update.

  Changed 5 years ago by rgs

  • cc rgs added

  Changed 5 years ago by crodas

  • cc crodas added

in reply to: ↑ 25   Changed 5 years ago by pgf

  • next_action changed from design to add to build

Replying to sascha_silbe:

Reopening as 395c555e6f927 causes keyboard and mouse to not work at all after resume on XO-1.

i've committed changes which revert the effect of that commit on XO-1. it should now only change behavior on XO-1.5.

(i'm not sure how XO-1 bugs like this will get closed. perhaps you, dear reader, when you next run a new f11-on-xo1 release will be inspired to do so.)

  Changed 5 years ago by sascha_silbe

  • status changed from reopened to closed
  • resolution set to fixed

Fix confirmed (after reverting 14f2fe76fe0b16685b409ab1d2471b9f4c1dc54d due to #9967), thanks!

  Changed 4 years ago by Quozl

  • next_action changed from add to build to no action
  • milestone changed from 1.0-software-update to 10.1.0

Moved milestone; this ticket was not a problem on an XO-1 stable release, the regression was during XO-1 development alone.

Note: See TracTickets for help on using tickets.