hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Sanchez <bill.sanchez2...@gmail.com>
Subject HBase Large Load Issue
Date Tue, 03 Dec 2013 23:45:16 GMT
Hello,

I am seeking some advice on my hbase issue.  I am trying to configure a
system that will eventually load and store approximately 50GB-80GB of data
daily.  This data consists of files that are roughly 3MB-5MB each with some
reaching 20MB and some as small as 1MB.  The load job does roughly 20,000
puts to the same table spread across an initial set of 20 pre-split regions
on 20 region servers.  During the first load I see some splitting (ending
with around 50 regions) and in subsequent loads the number of regions will
go much higher.

After running similarly sized loads about 4 or 5 times I start to see the
following behavior that I cannot explain.  The table in question has
VERSIONS=1 and some of these test loads use the same data, but not all.
Below is a summary of the behavior along with a few of the configuration
settings I have tried so far.

Environment:

HBase 0.94.13-security with Kerberos enabled
Zookeeper 3.4.5
Hadoop 1.0.4

Symptoms:

1.  Requests per second fall to 0 for all region servers
2.  Log files show socket timeout exceptions after waiting for scans of META
3.  Region servers sometimes eventually show up as dead
4.  Once HBase reaches a broken state some regions show up as in a
transition state indefinitely
5.  All of these issues seem to happen around the time of major compaction
events

This issue seems to be sensitive to hbase.rpc.timeout which I increased
significantly but only served to lengthen the amount of time until I see
socket timeout exceptions.

A few notes:

1.  I don't see massive GC in the gc log.
2.  Originally Snappy compression was enabled, but as a test I turned it
off and it doesn't seem to make any difference in the testing.
3.  The WAL is disabled for the table involved in the load
4.  TeraSort appears to run normally in HDFS
5.  The HBase randomWrite and randomRead tests appear to run normally on
this cluster (although randomWrite does not write anywhere close to 3MB-5MB)
6.  Ganglia is available in my environment

Settings already altered:

1.  hbase.rpc.timeout=900000 (I realize this may be too high)
2.  hbase.regionserver.handler.count=100
3.  ipc.server.max.callqueue.size=10737418240
4.  hbase.regionserver.lease.period=900000
5.  hbase.hregion.majorcompaction=0 (I have been manually compacting
between loads with no difference in behavior)
6.  hbase.hregion.memstore.flush.size=268435456
7.  dfs.datanode.max.xcievers=131072
8.  dfs.datanode.handler.count=100
9.  ipc.server.listen.queue.size=256
10.  -Xmx16384m XX:+UseConcMarkSweepGC -XX:+UseMembar -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/logs/gc.log -Xms16384m
-XX:PrintFLSStatistics=1 -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC
11. I have tried other GC settings but they don't seem to have any real
impact on GC performance in this case

Any advice is appreciated.

Thanks

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message