Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#905 closed defect (fixed)

The libertas driver causes a 33% slowdown.

Reported by: cjb Owned by: marcelo
Priority: high Milestone: BTest-3
Component: kernel Version:
Keywords: relnote, performance Cc: cjb, JordanCrouse
Blocked By: Blocking:
Deployments affected: Action Needed:
Verified: yes

Description

As below:

bash-3.1# python pystone.py
Pystone(1.1) time for 50000 passes = 38.92
This machine benchmarks at 1284.69 pystones/second

bash-3.1# mv /lib/firmware/usb8388.bin /lib/firmware/usb8388.bin.bak
bash-3.1# rmmod usb8xxx
bash-3.1# python pystone.py
Pystone(1.1) time for 50000 passes = 28.73
This machine benchmarks at 1740.34 pystones/second

bash-3.1# mv /lib/firmware/usb8388.bin.bak /lib/firmware/usb8388.bin
bash-3.1# modprobe usb8xxx
bash-3.1# python pystone.py
Pystone(1.1) time for 50000 passes = 39.01
This machine benchmarks at 1281.72 pystones/second

The slowdown is consistent across all processes, not just python.

Change History (8)

comment:1 Changed 8 years ago by wmb@…

The pystone test illustrates the problem, but since it is complicated, it is of limited value for fault isolation. The following test illustrates the problem and is much simpler:

bash-3.1# wget http://dev.laptop.org/~wmb/spin.zip
bash-3.1# unzip spin.zip
bash-3.1# time ./forth spin.fth

real 0m9.234s
user 0m9.210s
sys 0m0.030s
bash-3.1# mv /lib/firmware/usb8388.bin /lib/firmware/usb8388.bin.bak
bash-3.1# rmmod usb8xxx
bash-3.1# time ./forth spin.fth

real 0m6.954s
user 0m6.930s
sys 0m0.020s
bash-3.1#

Notes:

spin.fth contains:

code spin

cx pop
begin loopne

c;
d# 500,000,000 spin

It essentially consists of 500,000,000 iterations of a 1-instruction "decrement cx and branch to self" loop, i.e. the LOOPNE instruction.

I have measured the runtime of this same loop under Open Firmware with interrupts turned off, using a stopwatch. The result was the same - 6.9 seconds. So the "fast" time under Linux really means "essentially no overhead".

This loop does not use memory at all. The instruction just sits there decrementing the register. It might not even hit the icache, depending on how the Geode's pipeline is implemented.

Note also that the "user" and "real" times are virtually identical. The 2.4 second difference appears in the "user" category, so whatever it is that is causing the slowdown is getting charged against userland. How does "time" account for time spent in interrupt handlers? Is that charged to the running process, or to the system? If interrupt time is charged to the system, it would imply that the slowdown is caused by some deep hardware stall instead of by time stolen by interrupt handlers.

comment:2 Changed 8 years ago by jg

  • Keywords cjb JordanCrouse added

comment:3 Changed 8 years ago by jg

  • Cc cjb JordanCrouse added
  • Keywords cjb JordanCrouse removed

comment:4 Changed 8 years ago by cjb

  • Keywords relnote performance added

comment:5 Changed 8 years ago by wmb@…

Here is a summary of the problem details:

a) The GX chip has a bug whereby DMA to cached memory intermittently causes instruction execution errors - the branch target cache sometimes yields incorrect results when a cache snoop happens at just the wrong time.

b) The standard workaround for that bug is to use some special diagnostic features to force a 4-cycle CPU stall on a DMA-induced cache snoop.

c) Under most system loads, the performance degradation of that workaround is between 1% and 4% - usually an acceptable amount.

d) USB network interfaces require repetitive polling at the USB transaction level in order to accept incoming packets with low latency. This is just the way that USB works. The polling is done automatically by the USB host interface hardware, without CPU intervention. However, that polling results in repetitive DMA accesses to a descriptor, at short intervals (a few microseconds or less).

e) This repetitive DMA to a single location is a worst case for the branch cache interaction, resulting in >30% slowdown of other code executing on the CPU.

f) Turning off the workaround (b) risks random application and kernel crashes, so that option is not attractive.

g) We have verified that, if the USB descriptor is in uncached memory, the slowdown does not occur.

comment:6 follow-up: Changed 8 years ago by jg

  • Priority changed from blocker to high
  • Resolution set to fixed
  • Status changed from new to closed

We have turned off the cache snoop in build Q2B81. Let us know if you see any instability. But if not, the additional engineering work to finish debugging the other fix can better go into LX bringup.

This, by the way, was true for any USB network adaptor, not just wireless.

comment:7 in reply to: ↑ 6 Changed 8 years ago by hai

Replying to jg:

We have turned off the cache snoop in build Q2B81.

Sorry, this is a bit unclear. Do we now use option b) of comment #6 despite worries the machine will become unstable? Or do we use option g)?

comment:8 Changed 8 years ago by jg

We have now disabled the workaround. The instability has not been observed on Linux, and we plan to use the Geode LX in production, which has the problem entirely fixed.

Note: See TracTickets for help on using tickets.