Ticket #7612 (closed defect: fixed)

Opened 6 years ago

Last modified 6 years ago

XRender cursor rendering broken with XShmPutImage

Reported by: bert Owned by: cscott
Priority: blocker Milestone: 8.2.0 (was Update.2)
Component: x window system Version: Development build as of this date
Keywords: 8.2.0:? joyride-2208:- blocks:8.2.0 joyride-2363:+ 8.2-759:+ Cc: bernie, JordanCrouse, etoys, dsd, jg, gregorio
Action Needed: test in release Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Joyride-2201: The default large cursor in Etoys leaves gribblies on the screen when drawing at the same location. This can most easily seen by running Etoys and wiggling the mouse pointer over the moving red car in its start screen.

Apparently, the cursor rendering does not update the patch it grabbed from the frame buffer if that frame buffer location is written to. This should not happen, and particularly not if we were using the hardware cursor.

The problem does not occur in jhbuild (Xephyr) or regular Linux, so I am pretty certain this is not an Etoys bug. Also, the stable build (708) is fine. Setting to blocker since this is a regression.

Attachments

sqcursortest.c (24.5 kB) - added by ohshima 6 years ago.
a small program that exhibits the problem.

Change History

  Changed 6 years ago by bert

  • summary changed from XRender cursor rendering broken to XRender cursor rendering broken with XShmPutImage

Etoys drawing to the screen uses XShmPutImage(). Switching the drawing code to XPutImage() fixes the gribblies (remove the -xshm option in /usr/bin/etoys).

Also, in case this is relevant - the code creating the cursor is in display_ioSetCursorARGB(): http://squeakvm.org/cgi-bin/viewcvs.cgi/branches/olpc/platforms/unix/vm-display-X11/sqUnixX11.c?view=auto while

  Changed 6 years ago by dsd

  • keywords 8.2.0:? added

follow-up: ↓ 4   Changed 6 years ago by dsd

  • cc dsd added

Please retest using joyride-2207 or newer. We have a new xserver, and some visual glitches I was seeing in gstreamer's ximagesink are now gone.

in reply to: ↑ 3   Changed 6 years ago by bert

  • keywords joyride-2208:- added

Replying to dsd:

Please retest using joyride-2207 or newer. We have a new xserver, and some visual glitches I was seeing in gstreamer's ximagesink are now gone.

Does not work in 2208.

  Changed 6 years ago by dsd

Bernie says: watch out for a patch where ajax dropped the old software cursor code in favor of a DAMAGE based implementation, but I dunno how the _driver_ might have anything to do with damage tracking

  Changed 6 years ago by cjb

  • keywords blocks?:8.2.0 added

follow-up: ↓ 8   Changed 6 years ago by gregorio

  • cc jg added

Hi Jim,

Is this only going to affect eToys?

Bernie,

Is this cosmetic or does it make eToys unusable?

I definitely want this fixed but I don't think I would hold the release for it unless the answer to both of those questions is: "yes".

Let me know who can work on this and if they have something else they need to drop, give us that info too so we can make a call.

Thanks,

Greg S

in reply to: ↑ 7   Changed 6 years ago by ohshima

Replying to gregorio:

Hi Jim, Is this cosmetic or does it make eToys unusable?

It does make Etoys unusable.

I definitely want this fixed but I don't think I would hold the release for it unless the answer to both of those questions is: "yes".

Or either one is yes? And I think this is a blocker for a good reason.

  Changed 6 years ago by gregorio

  • keywords blocks:8.2.0 added; blocks?:8.2.0 removed

  Changed 6 years ago by gregorio

  • cc gregorio added

  Changed 6 years ago by dsd

Jordan Crouse (xf86-video-geode upstream) says this is very likely to be a bug in the xf86-video-geode driver. He's not in a position to work on this right now though.

  Changed 6 years ago by dsd

16:54 < CosmicPenguin> dsd_: one thing that could help would be a very simple 
                       test case to expose the problem
16:55 < CosmicPenguin> for obvious reasons, debugging a X server while running 
                       eToys is not optimal

any volunteers? :)

  Changed 6 years ago by ohshima

I'm certainly willing to help to debug, but not familiar enough with the driver code itself. Setting the cursor form is done by display_ioSetCursorWithMask() in:

http://squeakvm.org/cgi-bin/viewcvs.cgi/branches/olpc/platforms/unix/vm-display-X11/sqUnixX11.c?rev=1901&view=auto

So you could take out -O2 flags from a few Makefiles in the VM build directory, run a gdb session in X or in console, set a breakpoint at perhaps ioSetCursorARGB() in sqUnixMain.c or display_ioSetCursorWithMask(), set arg /usr/share/etoys/etoys.image and run and step the function.

This may not be what you are looking for (if so, sorry), but this would be my suggestion.

If the schedule is real tight and we don't have enough time to fix it, one thing we could do is to stop using the cursor with alpha for the release.

  Changed 6 years ago by JordanCrouse

What I was talking about was something more like this:

http://dev.laptop.org/~jcrouse/cursortest.c

You have to understand the importance of unit tests in this regard - Sugar (and indeed, most of the UI), uses compositing extensively. That means that the composite hooks in the driver get called literally dozens of times a second - finding the right function in that storm is very difficult. With a test case like the one I showed, the number of operations that need to be examined are reduced to 1 - which makes it rather easy to trace and debug. Using the program above, somebody familiar with how the cursor is changing can probably adapt the test case to reflect what is happening in display_ioSetCursorWithMask() and we can see if we can't shake loose a few bugs.

  Changed 6 years ago by ohshima

I started from your cursortest.c and trying to put the essence of the problem. It is harder than I thought, though. One thing I found out that it only happens that XShm is used. And of course the problem occurs only when some animation is going. I modified your cursortest.c to reproduce the problem, but haven't made the timing and stuff right yet...

I think understand the importance of unit test but still don't see what is the real difference between your cursortest.c and squeak BTW; yours doesn't really automatically test the appearance on the screen. And the problem only occurs in the case when something is animating, so number of operations to be examined is not going to be 1. And, the cursor doesn't have to be changed. It only happens once at the start up (that is why I suggested to use the debugger there), and subsequent partial screen updates, that I think is not related to compositing, is causing the problem.

  Changed 6 years ago by gregorio

Hi Ohsima,

Great work investigating this one. You may want to share what you have learned about how to debug/test X windows issues with the full devel list later.

For now I'm holding this one in blocker status but the engineers are trying to cut me off and create a release candidate.

Thanks a lot to you and Jordan for sticking with this one! Let me know if I should call for more help or there's anything else I can do.

Thanks,

Greg S

PS you may want to e-mail me greg at laptop dot org for faster reply if needed as I'm falling behind on bug comments.

  Changed 6 years ago by ohshima

Saying that it is not related to compositing was wrong, sorry about that. The problem only occurs when the program is running under the windows manager, so at there, it does involve composition.

Changed 6 years ago by ohshima

a small program that exhibits the problem.

  Changed 6 years ago by ohshima

Ok... on joyride 2298 and Q2E12, attached sqcursortest.c shows the problem. Compile it with: gcc -g -o sqcursortest sqcursortest.c -lXrender -lXext -lX11

and run it from the Terminal. If you just leave the cursor in the area, and wait the rectangle go under the cursor, you should see the problem.

  Changed 6 years ago by JordanCrouse

Excellent - thanks. If anybody gets a chance to look at this before me, you should look for a race between the upload function and the composite function in the X driver. Since the area goes black, we are probably dealing with a race in our old friend _GetPixelFromRGBA.

follow-up: ↓ 21   Changed 6 years ago by dsd

I can confirm that sqcursortest reproduces the problem, thanks! It also reproduces the problem just fine outside of sugar.

The area around the cursor goes white, not black.

in reply to: ↑ 20   Changed 6 years ago by ohshima

Replying to dsd:

It also reproduces the problem just fine outside of sugar.

Yes, I had some strange time to reproduce the problem, but with the latest, it seems to do it. The animation part is not necessary if you have a way to look into the right place.

The logic in .c is just a stripped down version of Squeak VM. If you can fix/workaround by fixing sqcursortest.c, please let us know. If the fix/workaround is feasible to merge back to the Squeak VM, we can do it, too.

  Changed 6 years ago by bert

  • blocking 8008 added

(In #8008) Yes we know, this is caused by #7612.

A temporary work-around is to disable usage of X shared memory by removing the -xshm option from /usr/bin/etoys (causing a performance degradation but makes the artefacts go away).

follow-up: ↓ 24   Changed 6 years ago by bernie

Seems like a dupe of #7830. We also stomped into it.

in reply to: ↑ 23   Changed 6 years ago by bert

Replying to bernie:

Seems like a dupe of #7830. We also stomped into it.

Technically, an older report cannot be a "duplicate" of a newer one, it's the other way around. Also, #7830 deals with general rendering artifacts, whereas this bug is about cursor rendering.

So no, I do not think this is a "dupe" of #7830 (but then, I know little about X internals).

  Changed 6 years ago by JordanCrouse

It sure looks like the same bug to me.

follow-up: ↓ 27   Changed 6 years ago by JordanCrouse

This seems to be the source of the issue - the damage function is "recovering" the pixmap by copying it from offscreen memory. Then, when it tries to composite the mouse on a subsequent step, the src is already wrong.

<-- Unaccelerated copy from exaShmPutImage -->
EXA fbCopyArea 33 66 129 130 -> 33 66                                           
EXA - DAMAGE REGION         
<-- Copy from damaged pixmap -->                                                    
[EXA] - copy 0,0 [2c9a00] ->160,195 [61940] 43x59                               
[EXA] - src = ffffffff dst = 55005500                                           
EXA - DAMAGE DONE
<-- Copy src pixmap for OVER composite
    operation -->                                                     
[EXA] - copy 160,195 [61940] ->0,0 [2c9a00] 43x59                               
[EXA] - src = ffffffff dst = ffffffff
<-- Composite operation -->                                           
[EXA]: SRC 2cb020,128,4,27,43                                                  
[EXA]: DST 0,2048,2,1024,768      

This definitely is related to XShm - the non-shared memory version in EXA goes down a different path. In the above output, DAMAGE REGION and DAMAGE DONE wrap the function DamageDamageRegion() which is only called in the exa handler for shared memory PutImage. At this point, its about a 50/50 chance that the bug still lives in the driver; we can't discount that EXA is doing the wrong thing, but up until now, every time I have blamed something on EXA, I have been proved wrong. I'm going to investigate more, but until you hear otherwise, go ahead and turn off shared memory to resolve this blocker. Yeah, it will be a little bit slower, but correct is better then fast any day of the week.

in reply to: ↑ 26   Changed 6 years ago by bert

Replying to JordanCrouse:

until you hear otherwise, go ahead and turn off shared memory to resolve this blocker.

Okay - what's a good way to test if we are running on a system with the bug? I'd like to limit the work-around to the necessary cases. Maybe test if we run on XO hardware? How do I do that?

follow-up: ↓ 30   Changed 6 years ago by cscott

I'm going to disable XShm in the 8.2 stream; I'll leave it enabled in the joyride stream in hopes of our actually fixing the root cause.

  Changed 6 years ago by bert

Workaround tested working in 8.2-754.

in reply to: ↑ 28   Changed 6 years ago by marco

Replying to cscott:

I'm going to disable XShm in the 8.2 stream; I'll leave it enabled in the joyride stream in hopes of our actually fixing the root cause.

Sugar uses XShm for the journal previews.

  Changed 6 years ago by dsd

I noticed that the fedora devel branch includes some new X patches, some related to EXA/Damage. So I built that package for OLPC: http://koji.fedoraproject.org/koji/taskinfo?taskID=778564

It solves the XSHm issue (sqcursortest now works fine) but text rendering is messed up.

  Changed 6 years ago by JordanCrouse

Unfortunate - do you have the changelog or the GIT commit that package was build from? I would like to see exactly what changed. Are you using the old -geode driver or the new one?

Also, if you (or somebody) could make a quick pycairo demo of the text rendering, I'll make sure its not driver related (probably not, if it wasn't there before).

  Changed 6 years ago by dsd

It was built from xorg-server-1.4.99.906.tar.bz2 with some patches applied (xserver-1.4.99-exa-master-upgrade.patch being the one that broke text and fixed xshm). I am now rebuilding with 825b3fe11d1b applied too, which looks like it should fix the issue

  Changed 6 years ago by JordanCrouse

dsd very kindly found me some python code that exposes the problem. The good news is that it seems to be fixed in the RandR 1.2 branch of the driver, so I must have done something right. The bad news is I'm not sure exactly what that was. Also, the RandR 1.2 branch has broken icon drawing, so I need to address that. Once I do, probably the best course of action would be to backport the EXA code to the old driver en-masse.

  Changed 6 years ago by dsd

  • next_action changed from diagnose to add to build

Fixed in xorg-x11-server-1.4.99.906-2.olpc3.2, coming soon to a joyride near you

  Changed 6 years ago by dsd

  • keywords joyride-2363:+ added

  Changed 6 years ago by dsd

  • owner changed from bernie to ApprovalForUpdate

Please include xorg-x11-server-1.4.99.906-2.olpc3.2 in 8.2 stream

  Changed 6 years ago by dsd

  • blocking 8008 removed

(In #8008) Let's track this in #7612

  Changed 6 years ago by dsd

  • owner changed from ApprovalForUpdate to cscott

|TestCase|

Run eToys. A car will move around the front screen. Move the mouse over the moving car, make sure that there are no visual glitches.

Michael approved this, scott can you tag please?

  Changed 6 years ago by cscott

This seems to have caused the regression in #8289. Even so, this will probably make it into the next stable.

Can you be more specific about the packages/versions involved, so I can write a proper change log, and tag these packages into dist-olpc3-testing? Thanks. (It seems xorg-x11-server-common should be involved, at least.)

  Changed 6 years ago by dsd

xorg-x11-server-Xorg-1.4.99.906-2.olpc3.2.i386.rpm and xorg-x11-server-common-1.4.99.906-2.olpc3.2.i386.rpm

Changelog:

  • Hide initial ugly X mouse cursor
  • Fix eToys rendering problems

  Changed 6 years ago by cscott

  • next_action changed from add to build to add to release

  Changed 6 years ago by cscott

  • next_action changed from add to release to test in release

Added to stable repo; should be in 758 and following.

  Changed 6 years ago by dsd

  • keywords 8.2-759:+ added
  • status changed from new to closed
  • resolution set to fixed

Confirmed working in build 759

Note: See TracTickets for help on using tickets.