hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Baptiste Note (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10960) hadoop cause system crash with “soft lock” and “hard lock”
Date Tue, 12 Aug 2014 05:36:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093762#comment-14093762
] 

Jean-Baptiste Note commented on HADOOP-10960:
---------------------------------------------

We saw somthing very similar to this problem (among others) recurring on RHEL5.
It appears that the kernel, after an initial, legit, softlockup report (for instance because
of IO contention), can go into a loop of reporting soft lockups and -- by the mere amount
of data spewed -- lock itself to panic.

For us it was due to dumping data to the (relatively slow) serial console, for you it may
be by dumping data to disk, which is presumably the cause for contention in your case.

Once you've cleared the way for real issues (controller problems, for instance), you may want
to in vestigate one of the following:
0) reduce kernel verbosity on the console (provided it reduces the amount of data dumped to
/var/log/messages, i'm not familiar with your setup, we're remote logging everything)
1) disable softlockup reboot
2) disable disk logging of kernel messages / log to tmpfs / log to a separate, dedicated system
*disk*

HTH

> hadoop cause system crash with “soft lock” and “hard lock”
> ----------------------------------------------------------
>
>                 Key: HADOOP-10960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10960
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: redhat rhel 6.3,6,4,6.5
> jdk1.7.0_45
> hadoop2.2
>            Reporter: linbao111
>            Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a while.
/var/log/messages shows repeatedly:
> Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables
mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif
wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables
mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif
wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel: 
> Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G        W  ---------------
   2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
> Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>]  [<ffffffff8104d088>]
wait_for_rqlock+0x28/0x40
> Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8  EFLAGS: 00000202
> Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
> Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
> Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
> Aug 11 06:30:42 jn4_73_128 kernel: FS:  0000000000000000(0000) GS:ffff880028200000(0000)
knlGS:0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
> Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000,
task ffff880c1def3500)
> Aug 11 06:30:42 jn4_73_128 kernel: Stack:
> Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000
ffff8807786c3f28
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8
0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78
00007f092d0ad700
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0
80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3>
90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> </em>
> and finally crashed
> crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux  /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore
> crash 6.1.0-5.el6
> Copyright (C) 2002-2012  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co
> Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
> Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> Copyright (C) 2005, 2011  NEC Corporation
> Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions.  Enter "help copying" to see the conditions.
> This program has absolutely no warranty.  Enter "help warranty" for details.
> GNU gdb (GDB) 7.3.1
> Copyright (C) 2011 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> please wait... (determining panic task)         
> WARNING: active task ffff881071850040 on cpu 12 not found in PID hash
>       KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
>     DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore  [PARTIAL DUMP]
>         CPUS: 24
>         DATE: Sun Aug 10 09:47:32 2014
>       UPTIME: 7 days, 16:00:19
> LOAD AVERAGE: 11.01, 3.11, 1.08
>        TASKS: 724
>     NODENAME: master1.otocyon.com
>      RELEASE: 2.6.32-431.5.1.el6.x86_64
>      VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
>      MACHINE: x86_64  (1895 Mhz)
>       MEMORY: 64 GB
>        PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
>          PID: 23976
>      COMMAND: "sh"
>         TASK: ffff881071850aa0  [THREAD_INFO: ffff880a05c80000]
>          CPU: 0
>        STATE: TASK_INTERRUPTIBLE (PANIC)
> crash> bt
> PID: 23976  TASK: ffff881071850aa0  CPU: 0   COMMAND: "sh"
>  #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
>  #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
>  #2 [ffff880028207c80] panic at ffffffff8152751a
>  #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
>  #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
>  #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
>  #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
>  #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
>  #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
>  #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
> #10 [ffff880028207ef0] notify_die at ffffffff810a153e
> #11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
> It happened on machines from different vendors,and I have tried to update to the latest
kernel from redhat. Can anyone with the same experience help?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message