hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy <ferdy.gal...@kalooga.com>
Subject Re: Task process exit with nonzero status of 134...
Date Mon, 08 Mar 2010 11:24:33 GMT
Hi,

We have had a lot of these crashes in the past. Random jobs were 
crashing with error code 134. Our environment is also linux-amd64. We 
tried all sorts of Hadoop versions,  and JVM deployments, but it did not 
have any positive effect.

We finally figured out it was a deep-rooted hardware problem. 
Communication between different cores of the cpu could get corrupted 
once and every while. This was due to a bad combination of the 
mainboard, cpu and/or memory. In our case the problem was solved by 
replacing all mainboards.

We could pinpoint and reproduce the problem using the following bash 
command (run as root):

while /bin/true; do taskset -c 0 echo -ne 
'\0272G@\0306\0256yY\0210\0304\0004\0327A\0024\0343\0034\0252\0016V\r\0232\0024\0334\0233\0333\0356\0311A\0367\0375Ewgkk\0253\0373\0351\007%'

| taskset -c 2 hexdump -b; done | grep 0000020 | grep -v 351

If you see any output on the console, it's means your hardware is 
affected. If you see no output for several minutes (or perhaps one 
hour), your machine is unlikely to be broken.

Hope this is of any help to you.

Ferdy

zward3x wrote:
> Thanks for all help.
>
> Will install u17, hope that this will resolve issue.
>
>
>
> Jean-Daniel Cryans-2 wrote:
>   
>> As I feared, you use the unholy u18... please revert to u17.
>>
>> See this thread for more information:
>> http://www.mail-archive.com/common-user@hadoop.apache.org/msg04633.html
>>
>> J-D
>>
>> On Sun, Mar 7, 2010 at 1:32 PM, zward3x <pasalic.zaharije@gmail.com>
>> wrote:
>>     
>>> $ java -version
>>> java version "1.6.0_18"
>>> Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
>>> Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
>>>
>>> there is nothing in stderr, but here is part from stdout
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV (0xb) at pc=0x00002b19ef8cc34e, pid=12633, tid=1104492864
>>> #
>>> # JRE version: 6.0_18-b07
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
>>> linux-amd64 )
>>> # Problematic frame:
>>> # V  [libjvm.so+0x2de34e]
>>> #
>>> # An error report file with more information is saved as:
>>> #
>>> /hadoop/mapred/local/taskTracker/jobcache/job_201003072002_0002/attempt_201003072002_0002_r_000019_0/work/hs_err_pid12633.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>>> #
>>>
>>> Also, file which is mentioned above (hs_err_pid12633.log) does not exist.
>>>
>>>
>>>
>>> Jean-Daniel Cryans-2 wrote:
>>>       
>>>>> i'm using hadoop 0.20.1 and hbase 0.20.3
>>>>>           
>>>> Sorry I meant java version.
>>>>
>>>>         
>>>>> i already try to put
>>>>>
>>>>> -XX:ErrorFile=/opt/hadoop/hadoop/logs/java/java_error%p.log
>>>>>
>>>>> in hadoop-env.sh as HADOOP_OPTS but after reduce crash i did not find
>>>>> any
>>>>> file on that path.
>>>>>           
>>>> Todd doesn't talk about that, he said:
>>>>
>>>>         
>>>>> Generally along with a nonzero exit code you should see something in
>>>>> the stderr for that attempt. If you look on the TaskTracker inside
>>>>> logs/userlogs/attempt_<the failed attempt>/stderr do you see anything
>>>>> useful?
>>>>>           
>>>>         
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Task-process-exit-with-nonzero-status-of-134...-tp27814144p27814802.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>>       
>>     
>
>   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message