Ticket #10632 (closed defect: wontfix)

Opened 3 years ago

Last modified 3 years ago

USB failure when updating SD using OFW

Reported by: wad Owned by: wmb@…
Priority: normal Milestone: 1.75-firmware
Component: ofw - open firmware Version: 1.75-A2
Keywords: XO-1.75 Cc: gary, wad, martin.langhoff
Action Needed: no action Verified: yes
Deployments affected: Blocked By:
Blocking:

Description

On many if not all 1.75 A2 motherboards, an attempt to use fs-update to update the internal/external SD or eMMC will result in a "Short read of zdata file" error message from OFW.

The crashing usually happens either around block 50 or around block 1200.

This was tested running OFW version Q4A10i. On motherboards that have shown this failure, running a wear-levelling test from Linux on the USB device doesn't find any errors.

I spent some time looking at the power rails (Vmain, +3.3V, +5V_USB, Vin). There doesn't seem to be any abnormal activity when OFW fails. The voltage spikes due to the Vmain power supply switching transient are slightly larger when running OFW fs-update than when running the Linux tests, but successful efforts to reduce the spikes didn't have any effect on the frequency of crashing.

Increasing the voltage of Vmain to 1.35V from 1.25V didn't have any effect on the crash frequency.

Change History

Changed 3 years ago by wad

On 1/11/11, Mitch discovered that:

a) The USB errors are "Babble", i.e. the host controller detects data on the USB bus after the end of transaction. That is a fatal error according to the USB spec, requiring that the endpoint be halted.

b) The error only happens if the length of an individual USB transfer exceeds 4K. If I restrict transfers to 4K each, I can do long operations, including a complete fs-update, without error. If the transfer length exceeds 4K by even one sector (4K + 512 bytes), I quickly see babble conditions.

c) I tried it with the ellisys USB Explorer, which did not notice a babble condition on the external bus at the error point. I'm not sure that's conclusive, but it suggests that the babble might be somehow related to conditions at the Hub/SOC.

d) I don't know the max transfer length for Linux; Lennart speculates that it might be 4K, per the filesystem block size and page size, but that is not confirmed. If Linux does indeed restrict to 4K, that would explain why Linux hasn't seen the problem.

Here are two versions of a test script:

ok text-off \ Speed things up by turning off screen output

ok select u:0 \ Open USB disk in raw mode

ok 1000 0 do (cr i . load-base 0 8 read-blocks 8 <> abort" x" loop

That script re-reads the same 8 blocks (4K) starting at block 0. You can replace the 8 with larger numbers (2 places) to increase the read length. For me, it works at b (decimal 11) and failing at c (decimal 12) on one USB stick, while working at 8 and failing at 9 on a different stick

ok text-off \ Speed things up by turning off screen output

ok select u:0 \ Open USB disk in raw mode

ok 1000 0 do (cr i . load-base i 8 * 8 read-blocks 8 <> abort" x" loop

The above script moves across the "disk" as it reads, instead of always reading the same blocks. It works at 8 and fails at 9 on the first stick above. Aha! It fails at 8 on the second stick. A third USB stick failed at a (decimal 10) blocks/transfer on the first test and failed at 9 on the second.

Changed 3 years ago by martin.langhoff

  • cc martin.langhoff added

Changed 3 years ago by Quozl

  • milestone changed from Not Triaged to 1.75-firmware

Fix milestone.

Changed 3 years ago by wmb@…

Increasing the poll-delay in dev/usb2/hcd/ehci/qhtd.fth from 100 us to 300 us appears to make the problem go away.

Changed 3 years ago by wad

Verified that increasing the poll-delay from 100 uS to 300 uS fixed USB errors from OFW on 1.75 A3 #37. This was a motherboard which failed with the value of poll-delay used in Q4A13H.

Changed 3 years ago by edmcnierney

Ed is updating this ticket to check Gary's notification account.

Changed 3 years ago by Quozl

  • status changed from new to closed
  • next_action changed from never set to no action
  • resolution set to wontfix

This hasn't been seen on B1, and it has been reasonably reliable on A3, closing.

Note: See TracTickets for help on using tickets.