hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Major Compaction Causes Cluster Failure
Date Fri, 17 Sep 2010 16:40:45 GMT
Sounds like there's an underlying HDFS issue, you should check those
machines' datanode logs at the time of the failure for any exception.

J-D

On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <swhitecross@gmail.com> wrote:
> Hi all -
>
> A couple of nights ago I enabled cron jobs to run major compactions against
> a few of the tables that I use in HBase.  This has caused multiple worker
> machines on the cluster to fail.  Based on the compaction or losing the
> worker nodes, many of the regions are stuck in transition with a state of
> PENDING_CLOSE.  I believe resetting HBase master will solve that, which will
> do after a few of the current processes finish.  What is the risk for losing
> the regions stuck in transition?  (Running HBase .20.5)
>
> I am concerned about not being able to successfully run compactions on our
> cluster.  It was my understanding that major compactions happened
> automatically around every 24 hours, so I'm surprised forcing the process to
> happen caused issues.  Any suggestions on how to start debugging the issue,
> or what settings to look at?  Starting to dig through logs shows that HBase
> couldn't access HDFS on the same box. (Log Below)
>
> Current running a cluster with 40 workers, a dedicated jobtracker box, and
> namenode/hbase master.
>
> The cron call that was caused the issue:
> 0 2 * * * echo "major_compact 'hbase_table' " |
> /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1
>
> 2010-09-16 20:37:12,917 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started.  Attempting to free 125488064 bytes
> 2010-09-16 20:37:12,952 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed. Freed 115363256 bytes.  Priority Sizes: Single=294.45364MB
> (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808)
> 2010-09-16 20:37:29,011 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started.  Attempting to free 125542912 bytes
> 2010-09-16 20:37:29,040 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed. Freed 115365552 bytes.  Priority Sizes: Single=333.65866MB
> (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808)
> 2010-09-16 20:37:39,626 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB
> (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883,
> Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit
> Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%,
> Evicted/Run=2569.053955078125
> 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /11.11.11.11:50010 for file
> /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block
> -7375581532956939954:java.io.EOFException
> at java.io.DataInputStream.readShort(DataInputStream.java:298)
> at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830)
>
>
> Thanks.
>

Mime
View raw message