Ticket #11658 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

OS28: Crashes in S/R on a busy network - related to WOL

Reported by: martin.langhoff Owned by: dilinger
Priority: blocker Milestone: 11.3.1
Component: kernel Version: not specified
Keywords: Cc: jnettlet, dilinger
Action Needed: no action Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Dilinger dixit:

I can run a machine for a week w/ aggressive s/r on a quiet network if I start pinging the machine from another machine, or pinging broadcast, it crashes within 30 mins

Stock os28. no changes other than to enable wol w/ broadcast (I don't recall if I backed out that change; I can test again to verify, though)

Attachments

wol_hang-1.gz (0.6 MB) - added by pgf 2 years ago.
epitaph dump. serial disabled. wake-on-wlan loop.
wol_hang-2 (0.7 MB) - added by pgf 2 years ago.
same as wol_hang-1, smaller 1MB log buffer
wol_hang-3 (1.0 MB) - added by pgf 2 years ago.
yet another
wol_hang-4.noesc (1.0 MB) - added by pgf 2 years ago.
as above, pm_async disabled
wol_hang-5.noesc (0.6 MB) - added by pgf 2 years ago.
wol_hang-8.noesc (1.0 MB) - added by pgf 2 years ago.
cache flushed during printk, and more output logging
wol_hang-9.noesc (1.0 MB) - added by pgf 2 years ago.
same config as wol_hang-8
wol_hang-12.noesc (1.0 MB) - added by pgf 2 years ago.
%p --> %pF for workqueue handler prints, and dilinger's techteam object patch

Change History

Changed 2 years ago by pgf

epitaph dump. serial disabled. wake-on-wlan loop.

Changed 2 years ago by pgf

wol_hang-1.gz was obtained with a repetitive wake-on-wlan loop. another machine was set up to ping the laptop every 4 seconds. the laptop was running:

#!/bin/sh

stop powerd

ethtool -s eth0 wol umb

zcat /runin/sdkit-arm/forth.gz > /tmp/forth
chmod +x /tmp/forth

sed -i 's/^wdt-long /wdt-short /' /runin/sdkit-arm/watchdog.fth

trap "/runin/sdkit-arm/watchdog stop" 0

/runin/sdkit-arm/watchdog start

while sleep 3
do
	/runin/sdkit-arm/watchdog ping
	rtcwake -s11 -mmem
done

Changed 2 years ago by pgf

forgot to say:

 kernel commandline was: initcall_debug debug no_console_suspend log_buf_len=4M

i'll reduce the buffer length.

Changed 2 years ago by pgf

same as wol_hang-1, smaller 1MB log buffer

Changed 2 years ago by pgf

yet another

Changed 2 years ago by pgf

as above, pm_async disabled

Changed 2 years ago by pgf

Changed 2 years ago by pgf

one more note about all these hangs: the machine has siv120d blacklisted.

Changed 2 years ago by pgf

wol_hang-8.noesc is from a kernel now patched with 1220526, for more output. in addition, there's now a flush_cache_all() at the end of printk, to eliminate the junk we're getting at the log rollover point:

diff --git a/kernel/printk.c b/kernel/printk.c
index a032d5e..0eab4e1 100644
--- a/kernel/printk.c 
+++ b/kernel/printk.c 
@@ -16,6 +16,8 @@ 
  *     01Mar01 Andrew Morton
  */

+#include <asm/cacheflush.h> 
+ 
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/tty.h>
@@ -832,6 +834,8 @@ static inline void printk_delay(void) 
        }
 }

+extern bool olpc_xo_1_75_is_suspending; 
+ 
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
        int printed_len = 0;
@@ -969,6 +973,9 @@ asmlinkage int vprintk(const char *fmt, va_list args) 
 out_restore_irqs:
        raw_local_irq_restore(flags);

+       if (unlikely(olpc_xo_1_75_is_suspending)) 
+           flush_cache_all(); 
+ 
        preempt_enable();
        return printed_len;
 }

Changed 2 years ago by pgf

cache flushed during printk, and more output logging

Changed 2 years ago by pgf

  • cc dilinger added

Changed 2 years ago by pgf

same config as wol_hang-8

Changed 2 years ago by pgf

i'm not going to attach -10 and -11, but what's interesing is all three of -9, -10, and -11 end right at

<3>[  788.931408] process_one_work: worker thread bf04e940 start

and bf04e940 is the address of if_sdio_host_to_card_worker().

Changed 2 years ago by pgf

%p --> %pF for workqueue handler prints, and dilinger's techteam object patch

Changed 2 years ago by martin.langhoff

  • milestone changed from Not Triaged to 11.3.1

Changed 2 years ago by pgf

this hang is very clearly related to the wireless and/or sdio drivers. the logs we've collected point to the hang occurring somewere in sdhci when we try and send a command to the card after having suspended sdhci. as a test, if we comment out the disable of the clock in sdhci_pxa_suspend(), we get through the previous hang point, only to fall over with

mmc1: Timeout waiting for hardware interrupt.

Changed 2 years ago by martin.langhoff

  • next_action changed from never set to test in build

Test in OS30

Changed 2 years ago by greenfeld

The major WOL hang issues seem to have been resolved in 11.3.1 os884/os31 on an XO-1.75.

However there still is at least one less-common case where the libertas driver hangs (#11711).

Changed 2 years ago by greenfeld

  • status changed from new to closed
  • next_action changed from test in build to no action
  • resolution set to fixed
Note: See TracTickets for help on using tickets.