Opened 10 years ago

Closed 10 years ago

Last modified 20 months ago

#905 closed defect (fixed)

The libertas driver causes a 33% slowdown.

Reported by: cjb Owned by: marcelo
Priority: high Milestone:
Component: kernel Version:
Keywords: relnote, performance Cc: cjb, JordanCrouse
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: yes


As below:

bash-3.1# python
Pystone(1.1) time for 50000 passes = 38.92
This machine benchmarks at 1284.69 pystones/second

bash-3.1# mv /lib/firmware/usb8388.bin /lib/firmware/usb8388.bin.bak
bash-3.1# rmmod usb8xxx
bash-3.1# python
Pystone(1.1) time for 50000 passes = 28.73
This machine benchmarks at 1740.34 pystones/second

bash-3.1# mv /lib/firmware/usb8388.bin.bak /lib/firmware/usb8388.bin
bash-3.1# modprobe usb8xxx
bash-3.1# python
Pystone(1.1) time for 50000 passes = 39.01
This machine benchmarks at 1281.72 pystones/second

The slowdown is consistent across all processes, not just python.

Change History (9)

comment:1 Changed 10 years ago by wmb@…

The pystone test illustrates the problem, but since it is complicated, it is of limited value for fault isolation. The following test illustrates the problem and is much simpler:

bash-3.1# wget
bash-3.1# unzip
bash-3.1# time ./forth spin.fth

real 0m9.234s
user 0m9.210s
sys 0m0.030s
bash-3.1# mv /lib/firmware/usb8388.bin /lib/firmware/usb8388.bin.bak
bash-3.1# rmmod usb8xxx
bash-3.1# time ./forth spin.fth

real 0m6.954s
user 0m6.930s
sys 0m0.020s


spin.fth contains:

code spin

cx pop
begin loopne

d# 500,000,000 spin

It essentially consists of 500,000,000 iterations of a 1-instruction "decrement cx and branch to self" loop, i.e. the LOOPNE instruction.

I have measured the runtime of this same loop under Open Firmware with interrupts turned off, using a stopwatch. The result was the same - 6.9 seconds. So the "fast" time under Linux really means "essentially no overhead".

This loop does not use memory at all. The instruction just sits there decrementing the register. It might not even hit the icache, depending on how the Geode's pipeline is implemented.

Note also that the "user" and "real" times are virtually identical. The 2.4 second difference appears in the "user" category, so whatever it is that is causing the slowdown is getting charged against userland. How does "time" account for time spent in interrupt handlers? Is that charged to the running process, or to the system? If interrupt time is charged to the system, it would imply that the slowdown is caused by some deep hardware stall instead of by time stolen by interrupt handlers.

comment:2 Changed 10 years ago by jg

  • Keywords cjb JordanCrouse added

comment:3 Changed 10 years ago by jg

  • Cc cjb JordanCrouse added
  • Keywords cjb JordanCrouse removed

comment:4 Changed 10 years ago by cjb

  • Keywords relnote performance added

comment:5 Changed 10 years ago by wmb@…

Here is a summary of the problem details:

a) The GX chip has a bug whereby DMA to cached memory intermittently causes instruction execution errors - the branch target cache sometimes yields incorrect results when a cache snoop happens at just the wrong time.

b) The standard workaround for that bug is to use some special diagnostic features to force a 4-cycle CPU stall on a DMA-induced cache snoop.

c) Under most system loads, the performance degradation of that workaround is between 1% and 4% - usually an acceptable amount.

d) USB network interfaces require repetitive polling at the USB transaction level in order to accept incoming packets with low latency. This is just the way that USB works. The polling is done automatically by the USB host interface hardware, without CPU intervention. However, that polling results in repetitive DMA accesses to a descriptor, at short intervals (a few microseconds or less).

e) This repetitive DMA to a single location is a worst case for the branch cache interaction, resulting in >30% slowdown of other code executing on the CPU.

f) Turning off the workaround (b) risks random application and kernel crashes, so that option is not attractive.

g) We have verified that, if the USB descriptor is in uncached memory, the slowdown does not occur.

comment:6 follow-up: Changed 10 years ago by jg

  • Priority changed from blocker to high
  • Resolution set to fixed
  • Status changed from new to closed

We have turned off the cache snoop in build Q2B81. Let us know if you see any instability. But if not, the additional engineering work to finish debugging the other fix can better go into LX bringup.

This, by the way, was true for any USB network adaptor, not just wireless.

comment:7 in reply to: ↑ 6 Changed 10 years ago by hai

Replying to jg:

We have turned off the cache snoop in build Q2B81.

Sorry, this is a bit unclear. Do we now use option b) of comment #6 despite worries the machine will become unstable? Or do we use option g)?

comment:8 Changed 10 years ago by jg

We have now disabled the workaround. The instability has not been observed on Linux, and we plan to use the Geode LX in production, which has the problem entirely fixed.

comment:9 Changed 20 months ago by Quozl

  • Milestone BTest-3 deleted

Milestone BTest-3 deleted

Note: See TracTickets for help on using tickets.