From Stack <st...@duboce.net>
Subject Re: HBase stability
Date Tue, 14 Dec 2010 18:52:14 GMT
On Tue, Dec 14, 2010 at 6:47 AM, baggio liu <baggioss@gmail.com> wrote:
>> This can be true.  Yes.  What are you suggesting here?  What should we
>> tune?
>> In fact, we  found the low ivalid speed is because datanode invalid limit
> per heartbeat. Many invaild block stay in namenode, and can not dispatch to
> datanode. We simply increase block number which datanode fetch per
> heartbeat.

Interesting.  So you changed this hardcoding?

  public static final int BLOCK_INVALIDATE_CHUNK = 100;

>> hdfs-630 has been applied to the branch-0.20-append branch (Its also
>> in CDH IIRC).
> Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure found,
> it'll exclude datanode,
> We can kick  failure disk out simplify and make block report to namenode.

Is this a code change you made Baggio?

>> Usually if RegionServer has issues getting to HDFS, it'll shut itself
>> down.  This is 'normal' perhaps overly-defensive behavior.  The story
>> should be better in 0.90 but would be interested in any list you might
>> have where you think we should be able to catch and continue.
>> Yes, absolutly it's  overly-defensive behavior, and if region server fail
> to make hdfs operation, fail-fast may be a well recovery mechanism. But some
> IOException is not fatal, in our branch, we add retry mechanism in common fs
> operation, such as exist().

Excellent.  Any chance of your contributing back your internal branch
fixes?  They'd be welcome.
> My itention is that whenever system start/scan,
> region server (as DFSClient) will
> create too many connections to datanode. And the number of connection will
> increase by store file number, when store file num reach a large value, the
> number of connection will out of control.


>  In most scence, scan is locality, in our cluster , more than 95%
> connection is not alive. (connection is estabilish, but there's no data is
> being read.), In our branch, we add a time-out  to close idle connection.
> And in long term,  we can re-use connection between DFSClient  and datanode.
> (may be this kind of re-use can be fulfill by RPC framework)

The above sounds great.  So, the connection is reestablished
automatically by DFSClient when a read comes in (I suppose HADOOP-3831
does this for you)?  Is the timeout in DFSClient or in HBase?

>> Yes.  Any suggestions from your experience?
> -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
> -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0
> -XX:MaxTenuringThreshold=7
> we make some trys in gc tuning. Focus less application stop , we use
> Parallel gc in youny gen, and CMS gc in old gen, the thredshould
> CMSInitiatingOccupancyFraction is the same as our hadoop cluster config, we
> have no idea about why it's 70 , not 71 ...
> May I get gc stratigy in your cluster ?

I just took a look at one of our production servers.  Here is our config.:

export SERVER_GC_OPTS="-XX:+DoEscapeAnalysis -XX:+AggressiveOpts
-XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m
-XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails

This is what we are running:

java version "1.6.0_14-ea"
Java(TM) SE Runtime Environment (build 1.6.0_14-ea-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b13, mixed mode)

(I say what we are running because I believe DoEscapeAnalysis is
disabled in later versions of JVM... I think its same for

I think NewSize should probably be changed -- the argument for such a
small NewSize was that w/o it, the young generation pause times grew
to become substantial.

Regards CMSInitiatingOccupancyFraction of 88%, I wonder how much of an
effect is having?

That said, the above seems to be working for us.

Regards your settings, you set:

-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0

I haven't looked at the source but going by this message,
the above just seems to be setting defaults.  Is that your

Do you monitor your GC activity?

>    1. Currently, datanode will send more data than DFSClient request,
> (mostly a whole block), it'll helpful in throughput , but it may cause some
> harm for latency, I just image we can add addtionly rpc read/write interface
> between DFSClient and datanode to reduce overhead in hdfs read/write.

When you say block above, you mean hfile block? Thats what hbase is
requesting though?  Pardon me if I'm not understanding what you are

>    2.  in datanode side , meta file and block file will duplicate open and
> close in every block operation. To reduce latency, we can re-use these file
> handle. Even, we can re-design store mechanism in datanode.

Yes.  Hopefully something can be done about this pretty soon.

Thanks for the above,

