hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pablo Musa <pa...@psafe.com>
Subject Re: RegionServers Crashing every hour in production env
Date Sun, 10 Mar 2013 18:59:33 GMT
 > That combo should be fine.

Great!!

 > If JVM is full GC'ing, the application is stopped.
 > The below does not look like a full GC but that is a long pause in system
 > time, enough to kill your zk session.

Exactly. This pause is really making the zk expire the RS which 
shutsdown (logs
in the end of the email).
But the question is: what is causing this pause??!!

 > You swapping?

I don't think so (stats below).

 > Hardware is good?

Yes, it is a 16 processor machine with 74GB of RAM and plenty disk space.
Below are some metrics I have heard about. Hope it helps.


** I am having some problems with the datanodes[1] which are having 
trouble to
write. I really think the issues are related, but cannot solve any of 
them :(

Thanks again,
Pablo

[1] 
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+jGOPUt15PwFjcCSzigE0APeM9FXaCrQfVbyQ@mail.gmail.com%3E

top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34, 2.55, 1.28
Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,  
0.0%st
Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k buffers
Swap: 51609592k total,   128312k used, 51481280k free,  1353400k cached

]$ vmstat -w
procs -------------------memory------------------ ---swap-- -----io---- 
--system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi bo   
in   cs  us sy  id wa st
  2  0     128312   32416928    5838288    5043560    0    0   202 53    
0    0   2  1  96  1  0

]$ sar
02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00     70.29
02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00     96.04
02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00     87.11
02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00     97.40
03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00     98.76
03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00     98.82
03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00     98.71
03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00     98.81
03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00     94.73
Average:        all      3.86      0.00      1.38      0.88 0.00     93.87

]$ iostat
Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013 _x86_64_    
(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.86    0.00    0.96    0.78    0.00   96.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read Blk_wrtn
sda               1.23        20.26        23.53  521533196 605566924
sdb               6.51       921.55       241.90 23717850730 6225863488
sdc               6.22       921.83       236.41 23725181162 6084471192
sdd               6.25       925.13       237.26 23810004970 6106357880
sde               6.19       913.90       235.60 23521108818 6063722504
sdh               6.26       933.08       237.77 24014594546 6119511376
sdg               6.18       914.36       235.31 23532747378 6056257016
sdf               6.24       923.66       235.33 23772251810 6056604008

Some more logging which reinforce that the RS crash is happening because of
timeout. However this time the GC log is not accusing a big time.

#####RS LOG#####
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.ClientCnxn: Client 
session timed out, have not heard from server in 257739ms for sessionid 
0x13d3c4bcba6014a, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.ClientCnxn: Client 
session timed out, have not heard from server in 226785ms for sessionid 
0x13d3c4bcba60149, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=61.91 MB, 
free=1.94 GB, max=2 GB, blocks=1254, accesses=60087, hits=58811, 
hitRatio=97.87%, , cachingAccesses=60069, cachingHits=58811, 
cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
2013-03-10 15:37:46,712 WARN org.apache.hadoop.hbase.util.Sleeper: We 
slept 225100ms instead of 3000ms, this is likely due to a long garbage 
collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-43236042-172.17.2.10-1362490844340:blk_-6834190810033122569_25150229
java.io.EOFException: Premature EOF: no length prefix available
         at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
         at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
         at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
2013-03-10 15:37:46,716 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call 
get([B@7caf69ed, 
{"timeRange":[0,9223372036854775807],"totalColumns":1,"cacheBlocks":true,"families":{"details":["ALL"]},"maxVersions":1,"row":"\\x00\\x00\\x00\\x00\\x00\\x12\\x93@"}),

rpc version=1, client version=29, methodsFingerPrint=1891768260 from 
172.17.1.71:51294 after 0 ms, since caller disconnected


#####GC LOG#####
2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530 secs] 
354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03 sys=0.00, 
real=0.00 secs]
2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370 secs] 
353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04 sys=0.00, 
real=0.00 secs]
2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570 secs] 
353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03 sys=0.00, 
real=0.01 secs]
2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440 secs] 
353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070 secs] 
362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03 sys=0.00, 
real=0.01 secs]
3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510 secs] 
363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04 sys=0.00, 
real=0.01 secs]
3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900 secs] 
365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03 sys=0.00, 
real=0.01 secs]
3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540 secs] 
370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03 sys=0.00, 
real=0.01 secs]
3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530 secs] 
368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03 sys=0.00, 
real=0.00 secs]
3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370 secs] 
368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07 sys=0.01, 
real=0.00 secs]
3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520 secs] 
374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07 sys=0.00, 
real=0.01 secs]
3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190 secs] 
373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06 sys=0.00, 
real=0.01 secs]
Heap
  par new generation   total 59008K, used 15497K [0x00000005fae00000, 
0x00000005fee00000, 0x00000005fee00000)
   eden space 52480K,  25% used [0x00000005fae00000, 0x00000005fbaff8c8, 
0x00000005fe140000)
   from space 6528K,  33% used [0x00000005fe140000, 0x00000005fe362b70, 
0x00000005fe7a0000)
   to   space 6528K,   0% used [0x00000005fe7a0000, 0x00000005fe7a0000, 
0x00000005fee00000)
  concurrent mark-sweep generation total 1093696K, used 319887K 
[0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000)
  concurrent-mark-sweep perm gen total 30464K, used 30427K 
[0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000)


On 03/08/2013 07:02 PM, Stack wrote:
> On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pablo@psafe.com> wrote:
>
>> 0.94 currently doesn't support hadoop 2.0
>>> Can you deploy hadoop 1.1.1 instead ?
>>>
>> I am using cdh4.2.0 which uses this version as default installation.
>> I think it will be a problem for me to deploy 1.1.1 because I would need to
>> "upgrade" the whole cluster with 70TB of data (backup everything, go
>> offline, etc.).
>>
>> Is there a problem to use cdh4.2.0?
>> I should send my email to cdh list?
>>
>>
> That combo should be fine.
>
>
>>   You Full GC'ing around this time?
>> The GC shows it took a long time. However it does not make any sense
>> to be it, since the same ammount of data was cleaned before and AFTER
>> in just 0.01 secs!
>>
>>
> If JVM is full GC'ing, the application is stopped.
>
>
>> [Times: user=0.08 sys=137.62, real=137.62 secs]
>>
>> Besides the whole time was used by system. That is what is bugging me.
>>
>>
> The below does not look like a full GC but that is a long pause in system
> time, enough to kill your zk session.
>
> You swapping?
>
> Hardware is good?
>
> St.Ack
>
>
>
>>   ...
>>
>>
>> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
>> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
>> real=0.01 secs]
>>
>> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
>> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
>> real=0.00 secs]
>>
>> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
>> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
>> sys=137.62, real=137.62 secs]
>>
>> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
>> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
>> real=0.01 secs]
>>
>> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
>> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
>> real=0.01 secs]
>>
>>
>> I really appreciate you guys helping me to find out what is wrong.
>>
>> Thanks,
>> Pablo
>>
>>
>>
>> On 03/08/2013 02:11 PM, Stack wrote:
>>
>>> What RAM says.
>>>
>>> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client
>>>
>>> session timed out, have not heard from server in 159348ms for sessionid
>>> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>>>
>>> You Full GC'ing around this time?
>>>
>>> Put up your configs in a place where we can take a look?
>>>
>>> St.Ack
>>>
>>>
>>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>>> ramkrishna.s.vasudevan@gmail.**com <ramkrishna.s.vasudevan@gmail.com>>
>>> wrote:
>>>
>>>   I think it is with your GC config.  What is your heap size?  What is the
>>>> data that you pump in and how much is the block cache size?
>>>>
>>>> Regards
>>>> Ram
>>>>
>>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>>   0.94 currently doesn't support hadoop 2.0
>>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>>
>>>>> Are you using 0.94.5 ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com> wrote:
>>>>>
>>>>>   Hey guys,
>>>>>> as I sent in an email a long time ago, the RSs in my cluster did
not
>>>>>>
>>>>> get
>>>>> along
>>>>>> and crashed 3 times a day. I tried a lot of options we discussed
in the
>>>>>> emails, but it not solved the problem. As I used an old version of
>>>>>>
>>>>> hadoop I
>>>>>
>>>>>> thought this was the problem.
>>>>>>
>>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
>>>>>>
>>>>> hadoop
>>>>> 2.0.0
>>>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>>>
>>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they
crash
>>>>>> every
>>>>>> hour and some times when the RS that holds the .ROOT. crashes all
>>>>>>
>>>>> cluster
>>>>> get
>>>>>> stuck in transition and everything stops working.
>>>>>> In this case I need to clean zookeeper znodes, restart the master
and
>>>>>>
>>>>> the
>>>>> RSs.
>>>>>> To avoid this case I am running on production with only ONE RS and
a
>>>>>> monitoring
>>>>>> script that check every minute, if the RS is ok. If not, restart
it.
>>>>>> * This case does not get the cluster stuck.
>>>>>>
>>>>>> This is driving me crazy, but I really cant find a solution for the
>>>>>> cluster.
>>>>>> I tracked all logs from the start time 16:49 from all interesting
nodes
>>>>>> (zoo,
>>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I think
is
>>>>>> usefull.
>>>>>>
>>>>>> There are some strange errors in the DATANODE2, as an error copiyng
a
>>>>>>
>>>>> block
>>>>>
>>>>>> to itself.
>>>>>>
>>>>>> The gc log points to GC timeout. However it is very weird that the
RS
>>>>>>
>>>>> spend
>>>>>
>>>>>> so much time in GC while in the other cases it takes 0.001sec. Besides,
>>>>>> the time
>>>>>> spent, is in sys which makes me think that might be a problem in
>>>>>>
>>>>> another
>>>>> place.
>>>>>> I know that it is a bunch of logs, and that it is very difficult
to
>>>>>>
>>>>> find
>>>>> the
>>>>>> problem without much context. But I REALLY need some help. If it
is not
>>>>>>
>>>>> the
>>>>>
>>>>>> solution, at least what I should read, where I should look, or which
>>>>>>
>>>>> cases
>>>>>
>>>>>> I
>>>>>> should monitor.
>>>>>>
>>>>>> Thank you very much,
>>>>>> Pablo Musa
>>>>>>
>>>>>>


Mime
View raw message