hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sreepathi <prasanna.sreepa...@gmail.com>
Subject Re: RegionServers Crashing every hour in production env
Date Sun, 10 Mar 2013 19:06:20 GMT
Hi Stack/Ted/Pablo,

Should we increase the hbase.rpc.timeout property to 5 minutes ( 300000 ms
)  ?

Regards,
- Sreepathi

On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <pablo@psafe.com> wrote:

> > That combo should be fine.
>
> Great!!
>
>
> > If JVM is full GC'ing, the application is stopped.
> > The below does not look like a full GC but that is a long pause in system
> > time, enough to kill your zk session.
>
> Exactly. This pause is really making the zk expire the RS which shutsdown
> (logs
> in the end of the email).
> But the question is: what is causing this pause??!!
>
> > You swapping?
>
> I don't think so (stats below).
>
> > Hardware is good?
>
> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk space.
> Below are some metrics I have heard about. Hope it helps.
>
>
> ** I am having some problems with the datanodes[1] which are having
> trouble to
> write. I really think the issues are related, but cannot solve any of them
> :(
>
> Thanks again,
> Pablo
>
> [1] http://mail-archives.apache.**org/mod_mbox/hadoop-hdfs-user/**
> 201303.mbox/%3CCAJzooYfS-F1KS+**jGOPUt15PwFjcCSzigE0APeM9FXaCr**
> QfVbyQ@mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+jGOPUt15PwFjcCSzigE0APeM9FXaCrQfVbyQ@mail.gmail.com%3E>
>
> top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34, 2.55,
> 1.28
> Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,
>  0.0%st
> Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k buffers
> Swap: 51609592k total,   128312k used, 51481280k free,  1353400k cached
>
> ]$ vmstat -w
> procs -------------------memory-----**------------- ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi bo   in
>   cs  us sy  id wa st
>  2  0     128312   32416928    5838288    5043560    0    0   202 53    0
>    0   2  1  96  1  0
>
> ]$ sar
> 02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00     70.29
> 02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00     96.04
> 02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00     87.11
> 02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00     97.40
> 03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00     98.76
> 03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00     98.82
> 03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00     98.71
> 03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00     98.81
> 03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00     94.73
> Average:        all      3.86      0.00      1.38      0.88 0.00     93.87
>
> ]$ iostat
> Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013 _x86_64_
>  (16 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.86    0.00    0.96    0.78    0.00   96.41
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read Blk_wrtn
> sda               1.23        20.26        23.53  521533196 605566924
> sdb               6.51       921.55       241.90 23717850730 6225863488
> sdc               6.22       921.83       236.41 23725181162 6084471192
> sdd               6.25       925.13       237.26 23810004970 6106357880
> sde               6.19       913.90       235.60 23521108818 6063722504
> sdh               6.26       933.08       237.77 24014594546 6119511376
> sdg               6.18       914.36       235.31 23532747378 6056257016
> sdf               6.24       923.66       235.33 23772251810 6056604008
>
> Some more logging which reinforce that the RS crash is happening because of
> timeout. However this time the GC log is not accusing a big time.
>
> #####RS LOG#####
> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.**ClientCnxn: Client
> session timed out, have not heard from server in 257739ms for sessionid
> 0x13d3c4bcba6014a, closing socket connection and attempting reconnect
> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.**ClientCnxn: Client
> session timed out, have not heard from server in 226785ms for sessionid
> 0x13d3c4bcba60149, closing socket connection and attempting reconnect
> 2013-03-10 15:37:46,712 DEBUG org.apache.hadoop.hbase.io.**hfile.LruBlockCache:
> Stats: total=61.91 MB, free=1.94 GB, max=2 GB, blocks=1254, accesses=60087,
> hits=58811, hitRatio=97.87%, , cachingAccesses=60069, cachingHits=58811,
> cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
> 2013-03-10 15:37:46,712 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 225100ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.**DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> BP-43236042-172.17.2.10-**1362490844340:blk_-**
> 6834190810033122569_25150229
>
> java.io.EOFException: Premature EOF: no length prefix available
>         at org.apache.hadoop.hdfs.**protocol.HdfsProtoUtil.**
> vintPrefixed(HdfsProtoUtil.**java:171)
>         at org.apache.hadoop.hdfs.**protocol.datatransfer.**
> PipelineAck.readFields(**PipelineAck.java:114)
>         at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer$**
> ResponseProcessor.run(**DFSOutputStream.java:670)
> 2013-03-10 15:37:46,716 ERROR org.apache.hadoop.hbase.**
> regionserver.HRegionServer:
> org.apache.hadoop.hbase.ipc.**CallerDisconnectedException: Aborting call
> get([B@7caf69ed, {"timeRange":[0,**9223372036854775807],"**
> totalColumns":1,"cacheBlocks":**true,"families":{"details":["**
> ALL"]},"maxVersions":1,"row":"**\\x00\\x00\\x00\\x00\\x00\\**x12\\x93@"}),
> rpc version=1, client version=29, methodsFingerPrint=1891768260 from
> 172.17.1.71:51294 after 0 ms, since caller disconnected
>
>
> #####GC LOG#####
> 2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530 secs]
> 354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03 sys=0.00,
> real=0.00 secs]
> 2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370 secs]
> 353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04 sys=0.00,
> real=0.00 secs]
> 2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570 secs]
> 353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> 2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440 secs]
> 353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05 sys=0.00,
> real=0.01 secs]
> 2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070 secs]
> 362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> 3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510 secs]
> 363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04 sys=0.00,
> real=0.01 secs]
> 3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900 secs]
> 365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> 3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540 secs]
> 370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> 3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530 secs]
> 368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03 sys=0.00,
> real=0.00 secs]
> 3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370 secs]
> 368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07 sys=0.01,
> real=0.00 secs]
> 3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520 secs]
> 374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07 sys=0.00,
> real=0.01 secs]
> 3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190 secs]
> 373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06 sys=0.00,
> real=0.01 secs]
> Heap
>  par new generation   total 59008K, used 15497K [0x00000005fae00000,
> 0x00000005fee00000, 0x00000005fee00000)
>   eden space 52480K,  25% used [0x00000005fae00000, 0x00000005fbaff8c8,
> 0x00000005fe140000)
>   from space 6528K,  33% used [0x00000005fe140000, 0x00000005fe362b70,
> 0x00000005fe7a0000)
>   to   space 6528K,   0% used [0x00000005fe7a0000, 0x00000005fe7a0000,
> 0x00000005fee00000)
>  concurrent mark-sweep generation total 1093696K, used 319887K
> [0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000)
>  concurrent-mark-sweep perm gen total 30464K, used 30427K
> [0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000)
>
>
>
> On 03/08/2013 07:02 PM, Stack wrote:
>
>> On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pablo@psafe.com> wrote:
>>
>>  0.94 currently doesn't support hadoop 2.0
>>>
>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>
>>>>  I am using cdh4.2.0 which uses this version as default installation.
>>> I think it will be a problem for me to deploy 1.1.1 because I would need
>>> to
>>> "upgrade" the whole cluster with 70TB of data (backup everything, go
>>> offline, etc.).
>>>
>>> Is there a problem to use cdh4.2.0?
>>> I should send my email to cdh list?
>>>
>>>
>>>  That combo should be fine.
>>
>>
>>    You Full GC'ing around this time?
>>> The GC shows it took a long time. However it does not make any sense
>>> to be it, since the same ammount of data was cleaned before and AFTER
>>> in just 0.01 secs!
>>>
>>>
>>>  If JVM is full GC'ing, the application is stopped.
>>
>>
>>  [Times: user=0.08 sys=137.62, real=137.62 secs]
>>>
>>> Besides the whole time was used by system. That is what is bugging me.
>>>
>>>
>>>  The below does not look like a full GC but that is a long pause in
>> system
>> time, enough to kill your zk session.
>>
>> You swapping?
>>
>> Hardware is good?
>>
>> St.Ack
>>
>>
>>
>>    ...
>>>
>>>
>>> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
>>> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
>>> real=0.01 secs]
>>>
>>> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
>>> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
>>> real=0.00 secs]
>>>
>>> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
>>> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
>>> sys=137.62, real=137.62 secs]
>>>
>>> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
>>> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
>>> real=0.01 secs]
>>>
>>> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
>>> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
>>> real=0.01 secs]
>>>
>>>
>>> I really appreciate you guys helping me to find out what is wrong.
>>>
>>> Thanks,
>>> Pablo
>>>
>>>
>>>
>>> On 03/08/2013 02:11 PM, Stack wrote:
>>>
>>>  What RAM says.
>>>>
>>>> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.******ClientCnxn:
>>>> Client
>>>>
>>>> session timed out, have not heard from server in 159348ms for sessionid
>>>> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>>>>
>>>> You Full GC'ing around this time?
>>>>
>>>> Put up your configs in a place where we can take a look?
>>>>
>>>> St.Ack
>>>>
>>>>
>>>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>>>> ramkrishna.s.vasudevan@gmail.****com <ramkrishna.s.vasudevan@gmail.**
>>>> com <ramkrishna.s.vasudevan@gmail.com>>>
>>>> wrote:
>>>>
>>>>   I think it is with your GC config.  What is your heap size?  What is
>>>> the
>>>>
>>>>> data that you pump in and how much is the block cache size?
>>>>>
>>>>> Regards
>>>>> Ram
>>>>>
>>>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>>
>>>>>   0.94 currently doesn't support hadoop 2.0
>>>>>
>>>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>>>
>>>>>> Are you using 0.94.5 ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com>
wrote:
>>>>>>
>>>>>>   Hey guys,
>>>>>>
>>>>>>> as I sent in an email a long time ago, the RSs in my cluster
did not
>>>>>>>
>>>>>>>  get
>>>>>> along
>>>>>>
>>>>>>> and crashed 3 times a day. I tried a lot of options we discussed
in
>>>>>>> the
>>>>>>> emails, but it not solved the problem. As I used an old version
of
>>>>>>>
>>>>>>>  hadoop I
>>>>>>
>>>>>>  thought this was the problem.
>>>>>>>
>>>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5
to
>>>>>>>
>>>>>>>  hadoop
>>>>>> 2.0.0
>>>>>>
>>>>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>>>>
>>>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they
>>>>>>> crash
>>>>>>> every
>>>>>>> hour and some times when the RS that holds the .ROOT. crashes
all
>>>>>>>
>>>>>>>  cluster
>>>>>> get
>>>>>>
>>>>>>> stuck in transition and everything stops working.
>>>>>>> In this case I need to clean zookeeper znodes, restart the master
and
>>>>>>>
>>>>>>>  the
>>>>>> RSs.
>>>>>>
>>>>>>> To avoid this case I am running on production with only ONE RS
and a
>>>>>>> monitoring
>>>>>>> script that check every minute, if the RS is ok. If not, restart
it.
>>>>>>> * This case does not get the cluster stuck.
>>>>>>>
>>>>>>> This is driving me crazy, but I really cant find a solution for
the
>>>>>>> cluster.
>>>>>>> I tracked all logs from the start time 16:49 from all interesting
>>>>>>> nodes
>>>>>>> (zoo,
>>>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I
think is
>>>>>>> usefull.
>>>>>>>
>>>>>>> There are some strange errors in the DATANODE2, as an error copiyng
a
>>>>>>>
>>>>>>>  block
>>>>>>
>>>>>>  to itself.
>>>>>>>
>>>>>>> The gc log points to GC timeout. However it is very weird that
the RS
>>>>>>>
>>>>>>>  spend
>>>>>>
>>>>>>  so much time in GC while in the other cases it takes 0.001sec.
>>>>>>> Besides,
>>>>>>> the time
>>>>>>> spent, is in sys which makes me think that might be a problem
in
>>>>>>>
>>>>>>>  another
>>>>>> place.
>>>>>>
>>>>>>> I know that it is a bunch of logs, and that it is very difficult
to
>>>>>>>
>>>>>>>  find
>>>>>> the
>>>>>>
>>>>>>> problem without much context. But I REALLY need some help. If
it is
>>>>>>> not
>>>>>>>
>>>>>>>  the
>>>>>>
>>>>>>  solution, at least what I should read, where I should look, or which
>>>>>>>
>>>>>>>  cases
>>>>>>
>>>>>>  I
>>>>>>> should monitor.
>>>>>>>
>>>>>>> Thank you very much,
>>>>>>> Pablo Musa
>>>>>>>
>>>>>>>
>>>>>>>
>


-- 
*Regards,*
--- *Sreepathi *

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message