hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: region server problem
Date Thu, 09 Oct 2008 02:04:17 GMT
I've seen "No live nodes contain current block" from DFS as a
symptom of what looks like (at a minimum) a race during
compaction when DFS is coming apart under load. This is my
hypothesis based on log examination at the time: Certain
mapfile data and/or index files are apparently deleted before
they should be. The namenode instructs the data nodes to
delete all block replicas associated with the file, yet
somewhere a region server still thinks it still has a lease on
one or more of those blocks, yet all block replicas are
deleted... and that's that. The region server goes down, and
the region is toast. 

I was in firefighting mode so stamped out the problem in our
deployment by adding additional nodes to spread load, and by
increasing CPU resources available to the DFS name node. So
unfortunately I did not have the ability/time to dive deep on
an analysis of this. 

Probably there are some actions that HBase should not attempt
when DFS is severely loaded. For example, compaction, or 
optional flushes. I don't know much about the namenode
protocol. Is there a way to get load estimates? 

   - Andy

> From: stack <stack@duboce.net>
> Subject: Re: region server problem
> To: hbase-user@hadoop.apache.org
> Date: Wednesday, October 8, 2008, 2:29 PM
> You should update to 0.2.1 if you can.  Make sure you've
> upped your file descriptors too:  See
> http://wiki.apache.org/hadoop/Hbase/FAQ#6.  Also 
> see how to enable DEBUG in same FAQ.
> 
> Something odd is up when you see messages like this out of
> HDFS: ': No live nodes contain current block*'.  Thats lost
> data.
> 
> Or messages like this, 'compaction completed on region 
> search1,r3_1_3_c157476,1223360357528 in 18mins, 39sec'
> -- i.e. that compactions are taking so long -- would seem to
> indicate your machines are severly overloaded or underpowered
> or both.  Can you study load when the upload is running on
> these machines?  Perhaps try throttling back to see if hbase
> survives longer?
> 
> The regionserver will output thread dump in its RPC layer
> if critical error -- OOME -- or its been hung up for a long
> time IIRC.
> 
> Check the '.out' logs too for you hbase install to see if
> they contain any errors.  Grep the datanode logs too for OOME
> or "too many open file handles".
> 
> St.Ack
> 
> Rui Xing wrote:
> > Hi All,
> >
> > 1). We are doing performance testing on hbase. The
> environment of the
> > testing is 3 data nodes, and 1 name node distributed
> on 4 machines. We
> > started one region server on each data node
> respectively. To insert the
> > data, one insertion client is started on each data
> node machine. But as the
> > data inserted, the region servers crashed one by one.
> One of the reasons is
> > listed as follows:
> >
> > *==>
> > 2008-10-07 14:47:01,519 WARN
> org.apache.hadoop.dfs.DFSClient: Exception
> > while reading from blk_-806310822584979460 of
> >
> /hbase/search1/1201761134/col9/mapfiles/3578469984425427480/data
> from
> > 10.2.6.102:50010: java.io.IOException: Premeture EOF
> from inputStream*
> >
> > ... ...
> >
> > *2008-10-07 14:47:01,521 INFO
> org.apache.hadoop.dfs.DFSClient: Could not
> > obtain block blk_-806310822584979460 from any node: 
> java.io.IOExceptionYou
> > 2008-10-07 14:52:25,229 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > compaction completed on region
> search1,r3_1_3_c157476,1223360357528 in
> > 18mins, 39sec
> > 2008-10-07 14:52:25,238 INFO
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > regionserver/0.0.0.0:60020.compactor exiting
> > 2008-10-07 14:52:25,284 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > closed search1,r3_1_3_c157476,1223360357528
> > 2008-10-07 14:52:25,291 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > closed -ROOT-,,0
> > 2008-10-07 14:52:25,291 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> aborting server at:
> > 10.2.6.104:60020
> > 2008-10-07 14:52:25,291 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver/
> > 0.0.0.0:60020 exiting
> > 2008-10-07 14:52:25,511 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> Starting shutdown
> > thread.
> > 2008-10-07 14:52:25,511 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> Shutdown thread complete
> > ===<
> >
> > 2). Another question is, under what circunstance will
> the region server
> > print logs of the thread information as below? It
> appears among the normal
> > log records.
> > ===>
> > 35 active threads
> > Thread 1281 (IPC Client connection to
> d3v1.corp.alimama.com/10.2.6.101:54310
> > ):
> >   State: RUNNABLE
> >   Blocked count: 0
> >   Waited count: 0
> >   Stack:
> >     java.util.Hashtable.remove(Hashtable.java:435)
> >    
> org.apache.hadoop.ipc.Client$Connection.run(Client.java:297)
> > ... ...
> > ===<
> >
> > We use hadoop 0.17.1 and hbase 0.2.0. It would be
> greatly appreciated if any
> > clues can be dropped.
> >
> > Regards,
> > -Ray
> >
> >


      

Mime
View raw message