hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pablo Musa <pa...@psafe.com>
Subject Re: RegionServers Crashing every hour in production env
Date Fri, 08 Mar 2013 18:58:24 GMT
> 0.94 currently doesn't support hadoop 2.0
> Can you deploy hadoop 1.1.1 instead ?

I am using cdh4.2.0 which uses this version as default installation.
I think it will be a problem for me to deploy 1.1.1 because I would need to
"upgrade" the whole cluster with 70TB of data (backup everything, go offline, etc.).

Is there a problem to use cdh4.2.0?
I should send my email to cdh list?

> Are you using 0.94.5 ?

I am using 0.94.2.

> I think it is with your GC config.  What is your heap size?  What is the
> data that you pump in and how much is the block cache size?

#JVM config:
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m -XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=2G
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/var/logs/hbase/gc-hbase.log"

# heap size
export HBASE_HEAPSIZE=8192

#hbase metrics
requestsPerSecond=8, numberOfOnlineRegions=1252, numberOfStores=1272, numberOfStorefiles=1651,
storefileIndexSizeMB=66, rootIndexSizeKB=68176, totalStaticIndexSizeKB=55028, totalStaticBloomSizeKB=0,
memstoreSizeMB=3, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=1176287,
writeRequestsCount=2165, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=328, maxHeapMB=8185,
blockCacheSizeMB=117.94, blockCacheFreeMB=1928.47, blockCacheCount=2083, blockCacheHitCount=34815,
blockCacheMissCount=10259, blockCacheEvictedCount=17, blockCacheHitRatio=77%, blockCacheHitCachingRatio=94%,
hdfsBlocksLocalityIndex=65, slowHLogAppendCount=0, fsReadLatencyHistogramMean=0, fsReadLatencyHistogramCount=0,
fsReadLatencyHistogramMedian=0, fsReadLatencyHistogram75th=0, fsReadLatencyHistogram95th=0,
fsReadLatencyHistogram99th=0, fsReadLatencyHistogram999th=0, fsPreadLatencyHistogramMean=0,
fsPreadLatencyHistogramCount=0, fsPreadLatencyHistogramMedian=0, fsPreadLatencyHistogram75th=0,
fsPreadLatencyHistogram95th=0, fsPreadLatencyHistogram99th=0, fsPreadLatencyHistogram999th=0,
fsWriteLatencyHistogramMean=0, fsWriteLatencyHistogramCount=0, fsWriteLatencyHistogramMedian=0,
fsWriteLatencyHistogram75th=0, fsWriteLatencyHistogram95th=0, fsWriteLatencyHistogram99th=0,
fsWriteLatencyHistogram999th=0

#hbase-site.xml
   <property>
       <name>hbase.hregion.memstore.mslab.enabled</name>
       <value>true</value>
   </property>
   <property>
       <name>hbase.regionserver.handler.count</name>
       <value>20</value>
   </property>

All the other parameters I am using are default, both hbase and hadoop.

Four tables with this same configuration.
{NAME => 'T1', FAMILIES => [{NAME => 'details', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
=> '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Rows from one table can vary from 4kb to 50kb while rows from the other 3
usually vary from 60 bytes to 300 bytes.

> You Full GC'ing around this time?

The GC shows it took a long time. However it does not make any sense
to be it, since the same ammount of data was cleaned before and AFTER
in just 0.01 secs!

[Times: user=0.08 sys=137.62, real=137.62 secs]

Besides the whole time was used by system. That is what is bugging me.

  ...

1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]

1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
real=0.00 secs]

1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
sys=137.62, real=137.62 secs]

1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]

1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]


I really appreciate you guys helping me to find out what is wrong.

Thanks,
Pablo


On 03/08/2013 02:11 PM, Stack wrote:
> What RAM says.
>
> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.**ClientCnxn: Client
> session timed out, have not heard from server in 159348ms for sessionid
> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>
> You Full GC'ing around this time?
>
> Put up your configs in a place where we can take a look?
>
> St.Ack
>
>
> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> I think it is with your GC config.  What is your heap size?  What is the
>> data that you pump in and how much is the block cache size?
>>
>> Regards
>> Ram
>>
>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>>> 0.94 currently doesn't support hadoop 2.0
>>>
>>> Can you deploy hadoop 1.1.1 instead ?
>>>
>>> Are you using 0.94.5 ?
>>>
>>> Thanks
>>>
>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com> wrote:
>>>
>>>> Hey guys,
>>>> as I sent in an email a long time ago, the RSs in my cluster did not
>> get
>>>> along
>>>> and crashed 3 times a day. I tried a lot of options we discussed in the
>>>> emails, but it not solved the problem. As I used an old version of
>>> hadoop I
>>>> thought this was the problem.
>>>>
>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
>> hadoop
>>>> 2.0.0
>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>
>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash
>>>> every
>>>> hour and some times when the RS that holds the .ROOT. crashes all
>> cluster
>>>> get
>>>> stuck in transition and everything stops working.
>>>> In this case I need to clean zookeeper znodes, restart the master and
>> the
>>>> RSs.
>>>> To avoid this case I am running on production with only ONE RS and a
>>>> monitoring
>>>> script that check every minute, if the RS is ok. If not, restart it.
>>>> * This case does not get the cluster stuck.
>>>>
>>>> This is driving me crazy, but I really cant find a solution for the
>>>> cluster.
>>>> I tracked all logs from the start time 16:49 from all interesting nodes
>>>> (zoo,
>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
>>>> usefull.
>>>>
>>>> There are some strange errors in the DATANODE2, as an error copiyng a
>>> block
>>>> to itself.
>>>>
>>>> The gc log points to GC timeout. However it is very weird that the RS
>>> spend
>>>> so much time in GC while in the other cases it takes 0.001sec. Besides,
>>>> the time
>>>> spent, is in sys which makes me think that might be a problem in
>> another
>>>> place.
>>>>
>>>> I know that it is a bunch of logs, and that it is very difficult to
>> find
>>>> the
>>>> problem without much context. But I REALLY need some help. If it is not
>>> the
>>>> solution, at least what I should read, where I should look, or which
>>> cases
>>>> I
>>>> should monitor.
>>>>
>>>> Thank you very much,
>>>> Pablo Musa
>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message