hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2774) Spin in ReadWriteConsistencyControl eating CPU (load > 40) and no progress running YCSB on clean cluster startup
Date Wed, 23 Jun 2010 22:23:49 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881924#action_12881924
] 

stack commented on HBASE-2774:
------------------------------

I tried 0.20.5.  It don't seem to have the issue -- least load doesn't go up crazy in same
place (It doesn't have the multput though as Ryan notes)

I added logging.  What I see are lots of updates where there is no spin at all and then we'll
hit a period where we can have crazy spinning as per this:

{code}
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=58ms, begin
took=908ns, w=writeNumber=389286, completed=true, spun=4161, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=40ms, begin
took=1118ns, w=writeNumber=389234, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=50ms, begin
took=838ns, w=writeNumber=389287, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=74ms, begin
took=838ns, w=writeNumber=389289, completed=true, spun=9535, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=64ms, begin
took=1117ns, w=writeNumber=389230, completed=true, spun=0, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=56ms, begin
took=1118ns, w=writeNumber=389226, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=72ms, begin
took=1118ns, w=writeNumber=389290, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=68ms, begin
took=1396ns, w=writeNumber=389291, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=93ms, begin
took=1047ns, w=writeNumber=389292, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=78ms, begin
took=768ns, w=writeNumber=389293, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=69ms, begin
took=907ns, w=writeNumber=389295, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=86ms, begin
took=768ns, w=writeNumber=389296, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=76ms, begin
took=768ns, w=writeNumber=389298, completed=true, spun=3842, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=47ms, begin
took=838ns, w=writeNumber=389225, completed=true, spun=0, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=44ms, begin
took=908ns, w=writeNumber=389223, completed=true, spun=0, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=58ms, begin
took=908ns, w=writeNumber=389214, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=66ms, begin
took=1467ns, w=writeNumber=389300, completed=true, spun=0, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=66ms, begin
took=978ns, w=writeNumber=389211, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=62ms, begin
took=768ns, w=writeNumber=389302, completed=true, spun=1437, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=59228ms,
begin took=2165ns, w=writeNumber=389195, completed=true, spun=12733714, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=59184ms,
begin took=1397ns, w=writeNumber=389196, completed=true, spun=9228124, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=152ms,
begin took=908ns, w=writeNumber=389304, completed=true, spun=37705, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=179ms,
begin took=1048ns, w=writeNumber=389303, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=81ms, begin
took=1187ns, w=writeNumber=389305, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=83ms, begin
took=1467ns, w=writeNumber=389306, completed=true, spun=0, count=10
2010-06-23 15:11:59,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=46ms, begin
took=1117ns, w=writeNumber=389308, completed=true, spun=0, count=10
2010-06-23 15:11:59,897 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=59584ms,
begin took=1467ns, w=writeNumber=389190, completed=true, spun=21534587, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=47ms, begin
took=1676ns, w=writeNumber=389310, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=55ms, begin
took=1117ns, w=writeNumber=389312, completed=true, spun=2258, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=50ms, begin
took=1117ns, w=writeNumber=389313, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=57ms, begin
took=768ns, w=writeNumber=389315, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=51ms, begin
took=698ns, w=writeNumber=389316, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=46ms, begin
took=699ns, w=writeNumber=389317, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=43ms, begin
took=699ns, w=writeNumber=389318, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=41ms, begin
took=698ns, w=writeNumber=389319, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=51ms, begin
took=698ns, w=writeNumber=389320, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=55ms, begin
took=698ns, w=writeNumber=389321, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=50ms, begin
took=838ns, w=writeNumber=389322, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=44ms, begin
took=698ns, w=writeNumber=389323, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=46ms, begin
took=698ns, w=writeNumber=389324, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=42ms, begin
took=699ns, w=writeNumber=389325, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=50ms, begin
took=698ns, w=writeNumber=389326, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=39ms, begin
took=699ns, w=writeNumber=389327, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=53ms, begin
took=699ns, w=writeNumber=389328, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=43ms, begin
took=699ns, w=writeNumber=389329, completed=true, spun=0, count=10
2010-06-23 15:11:59,901 INFO org.apache.hadoop.hbase.regionserver.HRegion: elapsed=50ms, begin
took=699ns, w=writeNumber=389330, completed=true, spun=0, count=10
{code}

My nano-math is bad in the above... I should have divided by another 1000 so when I say 50ms
above, it should be 0.05ms.

> Spin in ReadWriteConsistencyControl eating CPU (load > 40) and no progress running
YCSB on clean cluster startup
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2774
>                 URL: https://issues.apache.org/jira/browse/HBASE-2774
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>
> When I try to do a YCSB load, RSs will spin up massive load but make no progress.  Seems
to happen to each RS in turn until they do their first flush.  They stay in the high-load
mode for maybe 5-10 minutes or so and then fall out of the bad condition.
> Here is my ugly YCSB command (Haven't gotten around to tidying it up yet):
> {code}
> $ java -cp build/ycsb.jar:/home/hadoop/current/conf/:/home/hadoop/current/hbase-0.21.0-SNAPSHOT.jar:/home/hadoop/current/lib/hadoop-core-0.20.3-append-r956776.jar:/home/hadoop/current/lib/zookeeper-3.3.1.jar:/home/hadoop/current/lib/commons-logging-1.1.1.jar:/home/hadoop/current/lib/log4j-1.2.15.jar
 com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient  -P workloads/5050 -p columnfamily=values
-s -threads 100 -p recordcount=10000000
> {code}
> Cluster is 5 regionservers NOT running hadoop-core-0.20.3-append-r956776 but rather old
head of branch-0.20 hadoop.
> It seems that its easy to repro if you start fresh.  It might happen later in loading
but it seems as though after first flush, we're ok.
> It comes on pretty immediately.  The server that is taking on the upload has its load
start to climb gradually up into the 40s then stays there.  Later it falls when condtion clears.
> Here is content of my yahoo workload file:
> {code}
> recordcount=100000000
> operationcount=100000000
> workload=com.yahoo.ycsb.workloads.CoreWorkload
> readallfields=true
> readproportion=0.5
> updateproportion=0.5
> scanproportion=0
> insertproportion=0
> requestdistribution=zipfian
> {code}
> Here is my hbase-site.xml
> {code}
>   <property>
>     <name>hbase.regions.slop</name>
>     <value>0.01</value>
>     <description>Rebalance if regionserver has average + (average * slop) regions.
>     Default is 30% slop.
>     </description>
>   </property>
>   <property>
>     <name>hbase.zookeeper.quorum</name>
>     <value>XXXXXXXXX</value>
>   </property>
> <property>
>   <name>hbase.regionserver.hlog.blocksize</name>
>   <value>67108864</value>
>   <description>Block size for HLog files. To minimize potential data loss,
>     the size should be (avg key length) * (avg value length) * flushlogentries.
>     Default 1MB.
>   </description>
> </property>
> <property>
>   <name>hbase.hstore.blockingStoreFiles</name>
>   <value>25</value>
> </property>
> <property>
>   <name>hbase.rootdir</name>
>   <value>hdfs://svXXXXXX:9000/hbase</value>
>   <description>The directory shared by region servers.</description>
> </property>
> <property>
>   <name>hbase.cluster.distributed</name>
>   <value>true</value>
> </property>
> <property>
>   <name>zookeeper.znode.parent</name>
>   <value>/stack</value>
>   <description>
>     the path in zookeeper for this cluster
>   </description>
> </property>
> <property>
>   <name>hfile.block.cache.size</name>
>   <value>0.2</value>
>   <description>
>     The size of the block cache used by HFile/StoreFile. Set to 0 to disable.
>   </description>
> </property>
> <property>
>   <name>hbase.hregion.memstore.block.multiplier</name>
>   <value>8</value>
>   <description>
>     Block updates if memcache has hbase.hregion.block.memcache
>     time hbase.hregion.flush.size bytes.  Useful preventing
>     runaway memcache during spikes in update traffic.  Without an
>     upper-bound, memcache fills such that when it flushes the
>     resultant flush files take a long time to compact or split, or
>     worse, we OOME.
>   </description>
> </property>
> <property>
> <name>zookeeper.session.timeout</name>
> <value>60000</value>
> </property>
> <property>
>   <name>hbase.regionserver.handler.count</name>
>   <value>60</value>
>   <description>Count of RPC Server instances spun up on RegionServers
>     Same property is used by the HMaster for count of master handlers.
>     Default is 10.
>   </description>
> </property>
> <property>
>     <name>hbase.regions.percheckin</name>
>     <value>20</value>
> </property>
> <property>
>     <name>hbase.regionserver.maxlogs</name>
>     <value>128</value>
> </property>
> <property>
>     <name>hbase.regionserver.logroll.multiplier</name>
>     <value>2.95</value>
> </property>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message