Ticket #340 (closed defect: fixed)

Opened 8 years ago

Last modified 7 years ago

Marvell Wireless Upgrades: Hubs on other USB ports cause failures!

Reported by: mfoster Owned by: rchokshi
Priority: high Milestone: Trial-2
Component: wireless Version:
Keywords: Cc: alanc@…, rchokshi@…, marcelo
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Hi, Folks,

After many hours at the USB protocol analyzer, we can see some device combinations that routinely induce Boot2 upgrade or downgrade failures on the Marvell wireless module. All of these tests were performed booting from NAND.

Specifically, if a USB2 hub is present on one of the system's USB ports, with a USB1 device attached to it, the wireless upgrade will fail with nearly 100% certainty. It appears that the presence of the Split transactions which occur in this configuration confuse the Marvell module.

USB Protocol data for this failure is at:

 http://www.talix.com/Wireless_Upgrade_Failure_Hub-Keyboard.ufo

If a USB2 hub is present on one of the system's USB ports with a USB2 device attached to it, there is a very high probability that the wireless upgrade will fail.

 http://www.talix.com/Wireless_Upgrade_Failure_Hub-Disk.ufo

If a USB2 hub is present on one of the system's USB ports with no USB devices attached, the wireless upgrade will succeed.

 http://www.talix.com/Wireless_Upgrade_Success_Hub-Only.ufo

Note that it is the character of the USB traffic which appears to determine success of firmware upgrades. IMPORTANT NOTE is that we believe that the failures seen during programming attempts correlate well with failures loading the runtime RAM-based code in the normal driver.

Attachments

Wireless Mod.jpg Download (0.9 MB) - added by mfoster 8 years ago.

Change History

  Changed 8 years ago by mfoster

Hi, Gang,

New information. First, we asked some USB experts from Quanta to take a look at this problem. They brought along a high end LeCroy USB protocol analyzer. We used a slightly different technique to induce wireless failure. In this case, we booted with only the wireless and an empty USB2 hub, then plugged in a keyboard. The insertion of the keyboard into the hub consistenyl triggered wireless failures. We could see this in two ways. First, if we attempted to program the Boot2 ROM, the programming would fail. Second, if we instead attempted to load the wireless driver, the driver would get into an infinite loop where it would attempt to load the firmware, fail, reset the wireless module, and repeat that cycle forever.

The Quanta engineers analyzed this failure, and found the following comparison between good and failed transactions:

 http://www.talix.com/Marvell_Failure.gif

Their summary is that it looks as though the chip may have gotten confused about the size of its FIFO and accepted a too-long data burst. In the good transaction, we see that the 88W8388 immediately returns a NYET condition. In the bad transaction, it sends an ACK after 5.3 seconds!

We'll leave further analysis of these failures to Marvell, because the next comment contains our workaround solution.

  Changed 8 years ago by mfoster

After a comment by Alan about the wireless module's reset this morning, we began to see if manually resetting the wireless module would help to get the chip out of the error conditions we've been seeing. Simply connecting the pin to ground in fact worked perfectly. Every time we would induce the failure conditions, resetting the module would get everything back to proper functionality.

Skip forward about 6 hours, and we have a simple hardware fix that we will ECO into the B1 systems. If you look at the EC chip on the B1 boards (this is also on the pre-B boards), there is a test point T188 connected to the EC's GPIOEE pin. Connecting a wire from that point to the wireless module's Reset pin allows us the provision to control the wireless reset in software. We then asked Ray to change the EC code to ensure that the GPIOEE pin is left in a tri-state/input condition, allowing the wireless module to be reset normally. As a third component, David Woodhouse changed the Libertas driver to incorporate activation of the wireless reset line whenever the existing code attempts to reset the module. At the moment, this is a very long 500 mS (>10X longer than necessary) to ensure that when it happens, it'll be visible. While the wireless module is held in reset, the wireless activity light will remain off.

With these changes, the wireless module works perfectly in our testing thus far. We still will need real fixes from Marvell, but this change appears to have dramatically improved the stability of wireless in the OLPC system.

  Changed 8 years ago by mfoster

  • milestone changed from BTest-1 to BTest-2

  Changed 8 years ago by jg

If you can get me eco directions, we can test on boards here, which would make me happy. :-).

Changed 8 years ago by mfoster

  Changed 8 years ago by mfoster

Hi, Folks!

I've attached a JPEG image which shows the two points to connect to implement the ECO.

Cheers! MarkF

follow-up: ↓ 7   Changed 8 years ago by jg

Alan, Ronak,

Any progress at the underlying problem in the Boot2 code?

in reply to: ↑ 6   Changed 8 years ago by marcelo

Replying to jg:

Alan, Ronak, Any progress at the underlying problem in the Boot2 code?

Ronak claims that USB2.0 HUB problems have been fixed, but they recommend a workaround to the driver due to initial boot2 "download firmware" failure:

"When the driver sends out the first boot2 command, it somehow gets lost on the USB bus. This could be a boot2 command to download firmware or a command to download a new boot2 code. This happens only when a USB 2.0 hub is attached on one of the USB ports (it doesn't really matter which port the USB hub is attached to). To circumvent this problem, we had to re-design the driver to send this first boot2 command more than one and wait for an ACK. At this point, we are not sure why this happens but given the new driver patches that we will send you and the new boot2 code (v3106), you should not see this problem anymore."

Mark (or someone else capable) needs to test boot2 v3106 with an USB analyzer and:

1) Confirm that this is not happening anymore:

Their summary is that it looks as though the chip may have gotten confused about the size of its FIFO and accepted a too-long data burst. In the good transaction, we see that the 88W8388 immediately returns a NYET condition. In the bad transaction, it sends an ACK after 5.3 seconds!

2) Reproduce and analyze the initial boot2 "download firmware" command failure.

  Changed 8 years ago by rchokshi

hi Marcelo, Sorry for the delay in responding to you on this thread. Unfortunately, I dont have a USB trace to prove my point that the first boot2 command gets lost over the bus. I will need sometime which I probably dont have for the next few days - sorry about that again. Nevertheless, I think anybody with a USB analyzer trace should be able to capture this. It has been pretty consistent in our setup at least.

That said, I wouldnt waste people time and effort for this problem at the moment. From the user experience, it is seemless and has no large effect.

By the way, consider this to be a response to your email with subject line "Wireless Boot2 testing" also.

I am not too sure about your comment on the chip getting confused about the size of the FIFO. Not sure where you got that explanation from.

Thanks Ronak

follow-up: ↓ 10   Changed 8 years ago by jg

  • cc marcelo added
  • milestone changed from BTest-2 to BTest-3

What's the state of this problem? No reports on a blocker in a couple months is not very friendly.

When last heard from, we believed this was different than the general upgrade bug caused by driving the flash at the wrong speed, since fixed. (bug #318).

in reply to: ↑ 9   Changed 8 years ago by rchokshi

Marcelo, We owe you a USB analyzer trace to prove my point above. We have an A-test board which can be used for this - we are able to disable the on-board 8388 module and use an external 8388 reference board. Can you please point me to the build that can be used on this A-test board and which has the latest libertas driver?

  Changed 8 years ago by jg

  • priority changed from blocker to high
  • component changed from distro to wireless

What's the current state on this?

  Changed 8 years ago by rchokshi

As far as I understand, this issue has been fixed in the latest boot2 code v3107 that is being used starting from B-test2 machines.

  Changed 8 years ago by jg

IIRC, 3107 wasn't available in time for B2, but should be in B3.

I'll leave it open for the moment, unless someone verifies it is fixed in 3107.

follow-up: ↓ 15   Changed 7 years ago by jg

  • owner changed from mbletsas to rchokshi
  • verified unset

Is this fixed in 3107?

in reply to: ↑ 14   Changed 7 years ago by rchokshi

Replying to jg:

Is this fixed in 3107?

Yes. This was actually fixed in 3106. The change from 3106 --> 3107 was to resolve an EEPROM data corruption issue.

  Changed 7 years ago by jg

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.