Ticket #9415 (new defect)

Opened 5 years ago

Last modified 3 years ago

improve error handling in sdhci driver

Reported by: wad Owned by: dsaxena
Priority: high Milestone: Opportunity
Component: kernel Version: 1.5-A2
Keywords: XO-1.5 SD Cc:
Action Needed: code Verified: no
Deployments affected: Blocked By:
Blocking:

Description

On an XO-1.5 A2 prototype (#14), running Linux kernel 2.6.30_xo1.5-20090717.0115.1.olpc.ba8f22b and firmware Q2E05, I encountered write errors when transferring large files to the SD card.

I was running the NAND test script (described at: http://wiki.laptop.org/go/NAND_Testing ) and obtained four errors over several thousand cycles of testing (where each cycle is 94 MB of reading and 20 MB of writing).

The errors were not, however, bit errors. Examination of the two test files for one of the errors (one file is generated randomly, then copied to make the second) shows that a large region (around a megabyte) is wrong.

The attached dmesg and /var/log/messages logs show identical information about an error encountered by the kernel at the time of the error.

Attachments

dmesg.log (28.5 kB) - added by wad 5 years ago.
dmesg dump from laptop which had four different SD card write errors. Irrelevant sections were deleted from the logs
messages (63.7 kB) - added by wad 5 years ago.
/var/log/messages from a laptop had four SD write failures over time.
hotfileA (10.0 MB) - added by wad 5 years ago.
One random data test file
hotfileB (10.0 MB) - added by wad 5 years ago.
Second random data test file, which should be an exact copy of the first one, but isn't

Change History

Changed 5 years ago by wad

dmesg dump from laptop which had four different SD card write errors. Irrelevant sections were deleted from the logs

Changed 5 years ago by wad

/var/log/messages from a laptop had four SD write failures over time.

Changed 5 years ago by wad

One random data test file

Changed 5 years ago by wad

Second random data test file, which should be an exact copy of the first one, but isn't

Changed 5 years ago by wad

  • priority changed from blocker to high

The hardware SD card problems appear solved by adding 33 ohm damping resistors to the SD_CLK, SD_CMD, and SD_DATA lines. I took the four worst cases (laptops that couldn't complete a single ten MB writes without errors), ECO'd them, and have left them reading/writing all day without a single error. The two original test machines running over 24 hrs. didn't show any errors either.

Physically, this ECO requires removing all the solder from CON2 pins 9, 1, 2, 7, and 8. The pins are then carefully pried up with a very sharp X-Acto blade while heating the pad. A 33 ohm resistor (SMD-0402 or 0603) is then soldered to the pad, and a wire run to the lifted pin. On the top side of the motherboard, R130 (underneath and slightly north of the VX855) needs to be replaced with a 33 ohm SMD-0402 resistor. Attemps at removing the SD socket to simplify the ECO are discouraged, as the socket (particularly the side tabs) is almost guaranteed to be damaged.

I contend that there should be better error handling in the driver. These errors were detected by the driver, and future writes to the device (from the same application) did work. The device itself was undamaged (although if the writes occured while doing filesystem updates, the filesystem was corrupted.)

Leaving this ticket open to reflect the needed code improvement. An interested tester can contact wad in order to obtain a unmodified A2 laptop which generates lots of the errors.

Changed 3 years ago by dsd

  • next_action changed from diagnose to code
  • component changed from not assigned to kernel
  • summary changed from SD write errors from Linux on XO 1.5 to improve error handling in sdhci driver
  • milestone changed from Not Triaged to Opportunity
Note: See TracTickets for help on using tickets.