Opened 4 years ago

Closed 4 years ago

#12701 closed defect (fixed)

13.2.0-9 crash in cache_reap kworker

Reported by: Quozl Owned by: dsd
Priority: normal Milestone: 13.2.0
Component: kernel Version: Development build as of this date
Keywords: Cc: wad
Blocked By: Blocking:
Deployments affected: Action Needed: diagnose
Verified: no

Description

About three hours into a runin pass, the display hung, and some interesting kernel messages appeared on serial port:

http://dev.laptop.org/~quozl/z/1UmgpO.txt

XO-4 B1 SKU292

Some previous similar footprints are in #12688.

Change History (11)

comment:1 Changed 4 years ago by Quozl

  • Action Needed changed from reproduce to diagnose

XO-4 B1 SKU292 did crash within half an hour, http://dev.laptop.org/~quozl/z/1UmuPd.txt

comment:2 Changed 4 years ago by dsd

  • Cc wad added
  • Owner set to dsd
  • Summary changed from 13.2.0-9 crash in runin to 13.2.0-9 crash in cache_reap kworker

Lets characterise this bug according to this particular failure on 13.2.0 build 9:

Unable to handle kernel paging request at virtual address XYZ
PC is at free_block
LR is at drain_array
Process kworker (running the cache_reap task)

often seen a few seconds after resume.

According to those criteria we have 2 occurances shown here and one more at 12688#comment:41

This seems like random memory corruption which first shows its face when a periodic cache reaping task is executed.

comment:3 Changed 4 years ago by wad

I would point out that we aren't seeing errors from memtest, nor any unexplained crashes of running programs, so this isn't "random" memory corruption.

comment:4 Changed 4 years ago by dsd

Here is a debug kernel to try: http://dev.laptop.org/~dsd/20130613/kernel-3.5.7_xo4-20130613.1744.olpc.bf1c060.armv7hl.rpm

It has some memory management debugging enabled.

comment:5 Changed 4 years ago by dsd

James, as you have already reproduced Slab corruption messages with this kernel, maybe you could experiment and see if removing the wireless and/or bluetooth modules makes the slab corruption go away.

comment:6 Changed 4 years ago by Quozl

Ok.

comment:7 Changed 4 years ago by Quozl

Looks like it is mwifiex.

On the XO-4 B1 SKU292 the mwifiex_sdio module was blacklisted, verified not present using lsmod, and the slab corruption did not occur.

On the XO-4 C2 SKU306 the btmrvl_sdio module was blacklisted, verified not present using lsmod, and the slab corruption continued to occur. Serial console log at http://dev.laptop.org/~quozl/z/1UnKV5.txt

On the XO-4 C1 SKU296 after blacklisting mwifiex_sdio and btmrvl_sdio an occasional hang during boot has increased from two in ten to nine in ten. Still gathering data.

comment:8 Changed 4 years ago by Quozl

Looks like it is mwifiex suspend and resume.

Running with mwifiex, not btmrvl, with the same aggressive suspend and resume timing, and disabling the continuous interface up, down, and wireless scan:

touch /runin/no-wireless

did not change the symptom; slab corruption on fifth resume.

Switching from aggressive timing to standard timing:

ok change-tag TS RUNIN

reduces the frequency of the symptom in the time domain, but still resulted in slab corruption on fifth resume.

comment:9 Changed 4 years ago by Quozl

Disabling all tests except runin-gtk and runin-sus did not change the symptom; slab corruption after a small number of resumes. http://dev.laptop.org/~quozl/z/1UnNDR.txt

touch /runin/no-{accelerometer,battery,bluetooth,camera,\
cpu_temp,fscheck,light-sensor,memory,sdwrite,sound,wlan} \
/runin/extreme

comment:10 Changed 4 years ago by dsd

I managed to reproduce slab corruption with simple suspend/resume as you suggest. Pushed a fix to arm-3.5 62d14599eae3f.

comment:11 Changed 4 years ago by dsd

  • Resolution set to fixed
  • Status changed from new to closed

We haven't seen any more crashes in this area, and I sent the patch upstream.

Note: See TracTickets for help on using tickets.