Opened 7 years ago

Closed 7 years ago

#1837 closed defect (fixed)

Fonts rendering is broken

Reported by: marco Owned by: bernie
Priority: high Milestone: 8.2.0 (was Update.2)
Component: x window system Version:
Keywords: Cc: ajax, bernie, aleph
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: no

Description

On image 466 you can easily reproduce by opening the web browser and going on http://www.gnome.org. You will see corruption both in the toolbars and in the web page when you try to scroll.

Attachments (2)

text.py (1.1 KB) - added by marco 7 years ago.
Testcase
rendertest1837.c (3.2 KB) - added by bernie 7 years ago.

Download all attachments as: .zip

Change History (46)

comment:1 in reply to: ↑ description Changed 7 years ago by bernie

  • Owner changed from jg to bernie
  • Status changed from new to assigned

Also happens in these conditions:

  • Fedora Development as of today, on an iBook (so it's not OLPC specific)
  • Ubuntu Gutsy as of today, on an iBook (so it's not Fedora specific)
  • on r200 (excludes xf86-video-amd)
  • with or without EXA
  • both with gtk 2.11.4 and 2.10.13
  • both with cairo 1.4.4 and cairo 1.4.8

The only constant is X 1.3, so, yes, I'd blame it.

Oh, and btw: I can't reproduce it on my development laptop
which is running git's tip of X and other libraries. So
it's probably already fixed upstream.

comment:2 Changed 7 years ago by bernie

NoAccel appears to fix it on LX.

comment:4 Changed 7 years ago by JordanCrouse

Does no accel fix the problem on Ubuntu too?

comment:5 Changed 7 years ago by bernie

Yes, NoAccel fixes the problem (Fedora 8 on ppc, with an r200 card)

comment:7 Changed 7 years ago by bernie

Setting the default gtk theme fixes the bug.

Changed 7 years ago by marco

Testcase

comment:8 Changed 7 years ago by marco

This is not gtk theme specific, it's very easy to reproduce it with default gtk theme and gtk-demo.

Benzea wrote a testcase which does more or less what we do with gtk buttons in our theme. I rewrote it in python and attached it here. It renders fine on my F7 but breaks on the XO.

If I use lines instead of arcs it works fine (set ARC to True to try it out).

comment:9 Changed 7 years ago by JordanCrouse

I ran the test application on my test setup - it does indeed break (on mine, it draws N chars in the same color as the background). Debugging stubs in the LX driver showed that the same first N chars were being drawn with the same source color as the background. So the driver is very correctly rendering the text. This leads me to believe that at least the problem in text.py is *not* driver related.

One thing of interest is that the same source PicturePtr is being used for all of the operations - even when the background is a different color then the foreground. That might be interesting - if something is trying to render the same colors together, it might be getting confused. This is further bolstered by the fact that rectangle drawing in the test app doesn't cause this problem - this is because the rectange is drawn with a different primitive (not composite).

So the problem could be in several places - either in the X server where the compositing is being dispatched, or in GTK when the operations are being drawn. The two most unlikely culprits are the driver and Cairo.

comment:11 Changed 7 years ago by marco

Cairo is doing one XRenderCompositeText8 with all the glyphs in the text. The source is 1x1, the dest is the size of the drawing area. I could not find any difference between F7 and the XO at this level.

comment:12 Changed 7 years ago by marco

As a further data point, F7 on a geode LX board works fine. The only relevant software difference that I can think of are xorg.conf and the driver.

comment:13 Changed 7 years ago by JordanCrouse

And the X server - the vesa driver uses only software rendering, which goes down a different path. Believe me when I tell you - from my instrumentation on text.py, the driver is doing exactly what it is asked to do:

Draw u8 mask with source picture 1x1 (color 0x0000FF) size 400x150 <--- background
Draw u8 mask with source picture 1x1 (color 0x0000FF) size 11x14 <--- H
Draw u8 mask with source picture 1x1 (color 0x0000FF) size 10x11 <--- E
Draw u8 mask with source picture 1x1 (color 0x000000) size 2x15 <--- L
...

And so on. And we are drawing exactly that on the screen. This is further bolstered by the fact that if you draw a rectangle then everything works out OK - thats because a rectangle drawn by another primative, not a u8 mask, so there is less chance of variables being left over and other such programming mistakes. I'll run more tests today, and verify that we are correctly getting the color from the source pixmap, but I have to say, I'm not optimistic we'll find any glaring errors.

comment:14 Changed 7 years ago by jg

  • Cc ajax added

comment:15 Changed 7 years ago by bernie

In all my tests, I was using 16bpp. Both text.py and font-problem.c
work fine in 24bpp on the r200/ppc.

comment:16 Changed 7 years ago by bernie

Can't reproduce it on chris ball's laptop, running rawhide with intel_drv in 16bpp

This may mean that we're seeing two different bugs, one of which is r200 specific
and probably driver related.

comment:17 Changed 7 years ago by bernie

I can't reproduce it on the GX too (with build 482)

comment:18 Changed 7 years ago by jg

  • Cc bernie added
  • Owner changed from bernie to JordanCrouse
  • Priority changed from high to blocker
  • Status changed from assigned to new

We need to get to the bottom of this one....

Bernie, what's going on?

Adam, can you help?

comment:19 Changed 7 years ago by cjb

Another good way to reproduce on our images is to open the developer console -- the color of the tab header text will flash between white and black.

comment:20 Changed 7 years ago by bernie

  • Owner changed from JordanCrouse to dcbw

It was that trivial... Dan is committing it for me.

From a3088c6ba17d136131f76993d789305256fde2bf Mon Sep 17 00:00:00 2001
From: Bernardo Innocenti <bernie@codewiz.org>
Date: Wed, 11 Jul 2007 08:55:17 -0400
Subject: [PATCH] Fix potential CPU/GP race in lx_get_source_color()
Organization: One Laptop Per Child

---
 src/amd_lx_exa.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/src/amd_lx_exa.c b/src/amd_lx_exa.c
index 3f079cd..69c8fa6 100644
--- a/src/amd_lx_exa.c
+++ b/src/amd_lx_exa.c
@@ -262,6 +262,9 @@ static unsigned int lx_get_source_color(PicturePtr pSrc, int x, int y, int dstFo
   
   bits += (y * stride) + (x * (bpp >> 3));
 
+  /* GP may still be working on this surface */
+  gp_wait_until_idle();
+
   /* Read the source value */
 
   switch(bpp) {
-- 
1.5.2.2

comment:21 Changed 7 years ago by JordanCrouse

We have two fixes - Bernie's (above) and a little bit more global version from Adam. Bernie's is going into the build immediately to get everybody working again, and then Adam's (which touches the X server) will be along later. We'll merge some version of Adam's fix into the driver, but we'll probably have to play #ifdef tricks so 1.3 users of the driver outside of OLPC won't see the same breakage.

comment:22 follow-up: Changed 7 years ago by cmeadors

this is still in build 499

comment:23 in reply to: ↑ 22 Changed 7 years ago by tomeu

Replying to cmeadors:

this is still in build 499

running yum check-updates and yum update installs the new amd driver that has this problem fixed.

comment:24 Changed 7 years ago by bernie

  • Resolution set to fixed
  • Status changed from new to closed

Appears to be fixed in 502. Thankyou.

comment:25 Changed 7 years ago by JordanCrouse

  • Resolution fixed deleted
  • Status changed from closed to reopened

Not fixed - just bandaided. Keep it open until we're happy with the results.

comment:26 Changed 7 years ago by bernie

  • Resolution set to fixed
  • Status changed from reopened to closed

Waiting for the GPU before accessing memory with the CPU is the
correct and clean fix for a real bug, not a band aid.

Non reading the memory at all is just an optimization. And the
cairobench results published by cjb prove that avoiding that
stall is not going to buy us anything.

My idea is that there are probably other places in the driver
or in EXA causing GPU stalls. I'm going to run some tests
with oprofile to see if I can find the hot spots.

We should track these things in a different bug and close this one.
Jordan, feel free to reopen the bug again if you disagree.

comment:27 follow-up: Changed 7 years ago by cjb

Non reading the memory at all is just an optimization. And the cairobench results published by cjb prove that avoiding that stall is not going to buy us anything.

Just a note to say that I don't think my cairo results prove anything. We don't know if we're hitting the stall, so we don't know how bad it is when we hit it.

Jordan, feel free to reopen the bug again if you disagree.

When someone is known to disagree, it is much more polite to complete the conversation before changing the bug state, rather than switching the bug state each time you make a comment.

I'd love to see the oprofile results against a use case that we know triggered the bug, ie. using the developer console in sugar.

comment:28 in reply to: ↑ 27 ; follow-up: Changed 7 years ago by bernie

Replying to cjb:

Just a note to say that I don't think my cairo results prove anything. We don't know if we're hitting the stall, so we don't know how bad it is when we hit it.

I had a look at what cairobench does, and it never tries to fill
a non rectangular area with a solid color, which is what would
have triggered our codepath.

I still have my testcase that does exactly this, although it wasn't
meant to be benchmark. We could limit the number of iterations
and use time on it.

Jordan, feel free to reopen the bug again if you disagree.

When someone is known to disagree, it is much more polite to complete the conversation before changing the bug state, rather than switching the bug state each time you make a comment.

Sorry, but as I said, I'm not closing the bug to end the discussion.
I just think it doesn't belong here because it's unrelated to this
bug. It's a performance problem triggered by an EXA misbehavior.

I'd love to see the oprofile results against a use case that we know triggered the bug, ie. using the developer console in sugar.

Or the atteched XRender program...

Changed 7 years ago by bernie

comment:29 in reply to: ↑ 28 Changed 7 years ago by cjb

Replying to bernie:

I'd love to see the oprofile results against a use case that we know triggered the bug, ie. using the developer console in sugar.

Or the atteched XRender program...

Awesome, even better. That'll be extremely useful for isolating the profile. Thanks!

comment:30 follow-up: Changed 7 years ago by JordanCrouse

The reason why this fix hasn't been checked in yet is that it introduces a unilateral delay for all composite operations with a mask, which turns out to be used for pretty much any rendering other then a simple rectangle (nearly all the cairo 1.4 test suite use masks at some point). We are guarnteed a stall every time a new operation is started.

Adams patch make a determination of where the 1x1 source color lies, and if it is in system memory, we can avoid the delay. Presumably, over time the EXA migration will be fixed so that such simple pixmaps are ensured to live in system memory, making life easier for all of this.

I kept this bug open because it is a reminder to us that our current solution is sub-optimal, and that we need to ensure that Adams code gets into the driver (and X server), and further more that I need to backport the code so that it works on a 1.3 server as well. But I'm not going to get into a "revert war" with you over this - we'll let Jim decide what he wants to do.


comment:31 Changed 7 years ago by jg

Let's leave it open to remind ourselves to deal with this sooner rather than later; the suboptimal patch is being carried in the RPM, rather than checked in.

comment:32 Changed 7 years ago by cjb

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening, then.

comment:33 in reply to: ↑ 30 Changed 7 years ago by bernie

Replying to JordanCrouse:

The reason why this fix hasn't been checked in yet is that it introduces a unilateral delay for all composite operations with a mask, which turns out to be used for pretty much any rendering other then a simple rectangle (nearly all the cairo 1.4 test suite use masks at some point). We are guarnteed a stall every time a new operation is started.

I understand, but the current driver does the wrong thing faster. Along this line, you could make the driver even faster by deleting the remaining calls to gp_wait_until_idle() !

Adams patch make a determination of where the 1x1 source color lies, and if it is in system memory, we can avoid the delay.

Do you agree that the 1x1 pixmap will still be needlessly DMA'd to the framebuffer and then accessed in system memory? And if so, we'd still need to sync with the GPU before freeing the pixmap, right? But then we have the same stall we were trying to avoid... How do you fix that?

Anyway, I suggest profiling before taking action. It's easy to overlook bottlenecks with this architecture.

and further more that I need to backport the code so that it works on a 1.3 server as well.

How do you plan to fix it for 1.3 as in my proposal? If not, how?

comment:34 Changed 7 years ago by jg

  • Milestone changed from Trial-2 to Trial-3

comment:35 Changed 7 years ago by jg

Got a good fix in xserver head... According to Jordan...

comment:36 follow-up: Changed 7 years ago by jg

  • Priority changed from blocker to high

comment:37 in reply to: ↑ 36 Changed 7 years ago by bernie

Replying to jg:

A temporary patch was already in the amd driver we've been shipping since 502.

A better fix is already in f.d.o's git:

http://gitweb.freedesktop.org/?p=xorg/driver/xf86-video-amd.git;a=commitdiff;h=a6192811e21c9f8d17b409018f945adc2eea3594

Supposedly, it depends on this xserver enhancement by ajax:

http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=486fd4145aed93093d1f1655de40c0a8582bb8b1

but, on first glance, it seems to me that the old xserver code would
work anyway because it implicitly synchronizes.

Therefore, the extra synchronization in the amd driver is never needed
(both on 1.3 and 1.4), and can be ditched for good. Jordan, what do you
think?

comment:38 Changed 7 years ago by bernie

This fix for exaGetPixmapFirstPixel() is required along with the new fix:

http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=3c448b0eb67337b56641e09a6d168aad6745e3ef

Note that we are still needlessly uploading and then downloading the picture,
and with the current amd_drv I think we'll still synchronize with the GPU for
the download.

To fix it properly, I think we need a more radical change in the EXA heuristics
for uploading pixmaps to the framebuffer.

comment:39 Changed 7 years ago by aleph

  • Cc aleph added

comment:40 Changed 7 years ago by jg

  • Milestone changed from Trial-3 to First Deployment, V1.0

comment:41 Changed 7 years ago by dcbw

  • Owner changed from dcbw to bernie
  • Status changed from reopened to new

Over to bernie, if/when all the pieces are together in packages we can dump them into builds.

comment:42 Changed 7 years ago by bernie

The exaGetPixmapFirstPixel() fix is already in.

We have one extra gp_wait_until_idle() to drop, but it's
safer to do it right after Trial3, just in case.

comment:43 Changed 7 years ago by bernie

I think this is now safe to close. Anyone opposed?

comment:44 Changed 7 years ago by bernie

  • Resolution set to fixed
  • Status changed from new to closed

Thus, closed.

Note: See TracTickets for help on using tickets.