Ticket #11658 (closed defect: fixed)

Opened 16 months ago

Last modified 15 months ago

OS28: Crashes in S/R on a busy network - related to WOL

Reported by: martin.langhoff Owned by: dilinger
Priority: blocker Milestone: 11.3.1
Component: kernel Version: not specified
Keywords: Cc: jnettlet, dilinger
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Dilinger dixit:

I can run a machine for a week w/ aggressive s/r on a quiet network if I start pinging the machine from another machine, or pinging broadcast, it crashes within 30 mins

Stock os28. no changes other than to enable wol w/ broadcast (I don't recall if I backed out that change; I can test again to verify, though)

Attachments

wol_hang-1.gz (0.6 MB) - added by pgf 16 months ago.
epitaph dump. serial disabled. wake-on-wlan loop.
wol_hang-2 (0.7 MB) - added by pgf 16 months ago.
same as wol_hang-1, smaller 1MB log buffer
wol_hang-3 (1.0 MB) - added by pgf 16 months ago.
yet another
wol_hang-4.noesc (1.0 MB) - added by pgf 16 months ago.
as above, pm_async disabled
wol_hang-5.noesc (0.6 MB) - added by pgf 16 months ago.
wol_hang-8.noesc (1.0 MB) - added by pgf 16 months ago.
cache flushed during printk, and more output logging
wol_hang-9.noesc (1.0 MB) - added by pgf 16 months ago.
same config as wol_hang-8
wol_hang-12.noesc (1.0 MB) - added by pgf 16 months ago.
%p --> %pF for workqueue handler prints, and dilinger's techteam object patch

Change History

Changed 16 months ago by pgf

epitaph dump. serial disabled. wake-on-wlan loop.

Changed 16 months ago by pgf

wol_hang-1.gz was obtained with a repetitive wake-on-wlan loop. another machine was set up to ping the laptop every 4 seconds. the laptop was running:

#!/bin/sh

stop powerd

ethtool -s eth0 wol umb

zcat /runin/sdkit-arm/forth.gz > /tmp/forth
chmod +x /tmp/forth

sed -i 's/^wdt-long /wdt-short /' /runin/sdkit-arm/watchdog.fth

trap "/runin/sdkit-arm/watchdog stop" 0

/runin/sdkit-arm/watchdog start

while sleep 3
do
	/runin/sdkit-arm/watchdog ping
	rtcwake -s11 -mmem
done

Changed 16 months ago by pgf

forgot to say:

 kernel commandline was: initcall_debug debug no_console_suspend log_buf_len=4M

i'll reduce the buffer length.

Changed 16 months ago by pgf

same as wol_hang-1, smaller 1MB log buffer

Changed 16 months ago by pgf

yet another

Changed 16 months ago by pgf

as above, pm_async disabled

Changed 16 months ago by pgf

Changed 16 months ago by pgf

one more note about all these hangs: the machine has siv120d blacklisted.

Changed 16 months ago by pgf

wol_hang-8.noesc is from a kernel now patched with 1220526, for more output. in addition, there's now a flush_cache_all() at the end of printk, to eliminate the junk we're getting at the log rollover point:

diff --git a/kernel/printk.c b/kernel/printk.c
index a032d5e..0eab4e1 100644
--- a/kernel/printk.c 
+++ b/kernel/printk.c 
@@ -16,6 +16,8 @@ 
  *     01Mar01 Andrew Morton
  */

+#include <asm/cacheflush.h> 
+ 
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/tty.h>
@@ -832,6 +834,8 @@ static inline void printk_delay(void) 
        }
 }

+extern bool olpc_xo_1_75_is_suspending; 
+ 
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
        int printed_len = 0;
@@ -969,6 +973,9 @@ asmlinkage int vprintk(const char *fmt, va_list args) 
 out_restore_irqs:
        raw_local_irq_restore(flags);

+       if (unlikely(olpc_xo_1_75_is_suspending)) 
+           flush_cache_all(); 
+ 
        preempt_enable();
        return printed_len;
 }

Changed 16 months ago by pgf

cache flushed during printk, and more output logging

Changed 16 months ago by pgf

  • cc dilinger added

Changed 16 months ago by pgf

same config as wol_hang-8

Changed 16 months ago by pgf

i'm not going to attach -10 and -11, but what's interesing is all three of -9, -10, and -11 end right at

<3>[  788.931408] process_one_work: worker thread bf04e940 start

and bf04e940 is the address of if_sdio_host_to_card_worker().

Changed 16 months ago by pgf

%p --> %pF for workqueue handler prints, and dilinger's techteam object patch

Changed 16 months ago by martin.langhoff

  • milestone changed from Not Triaged to 11.3.1

Changed 16 months ago by pgf

this hang is very clearly related to the wireless and/or sdio drivers. the logs we've collected point to the hang occurring somewere in sdhci when we try and send a command to the card after having suspended sdhci. as a test, if we comment out the disable of the clock in sdhci_pxa_suspend(), we get through the previous hang point, only to fall over with

mmc1: Timeout waiting for hardware interrupt.

Changed 15 months ago by martin.langhoff

  • next_action changed from never set to test in build

Test in OS30

Changed 15 months ago by greenfeld

The major WOL hang issues seem to have been resolved in 11.3.1 os884/os31 on an XO-1.75.

However there still is at least one less-common case where the libertas driver hangs (#11711).

Changed 15 months ago by greenfeld

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed
Note: See TracTickets for help on using tickets.