Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com
 designates 209.85.160.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPZJ1tKqDZ2VYJfM4wOiXMPKg4mo2G29k9zYpj1K-eB3GupshQ@mail.gmail.com>
References: 
 <CAPZJ1t+Huewbs4dCbpGRx=gBVOpkdNa+H6v-v_gkwAzHnaKGoA@mail.gmail.com>
	<CALte62zKdASEpRqq-V1+SaKiNH4v==NBXh1v_Q-jn++-6Vn=nA@mail.gmail.com>
	<CAPZJ1tKqDZ2VYJfM4wOiXMPKg4mo2G29k9zYpj1K-eB3GupshQ@mail.gmail.com>
Date: Fri, 22 Aug 2014 14:00:49 -0700
Message-ID: 
 <CALte62wF2v3Zm2wQnaFE00++85RTYmRSVYEtb1VoKJnvJfonjg@mail.gmail.com>
Subject: Re: HBase balancer performance
From: Ted Yu <yuzhihong@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11333526c6b88405013e2103

--001a11333526c6b88405013e2103
Content-Type: text/plain; charset=UTF-8

Was the last 'Successfully deleted unassigned node' log close to T+45(min) ?

If so, it means that AssignmentManager was busy assigning these regions
during this period.

Cheers


On Fri, Aug 22, 2014 at 1:50 PM, Eric K <codemonkeeez@gmail.com> wrote:

> After about a minute everything appears to have been moved, since the
> master only prints out lines like:
>
> 2014-08-15 16:40:56,945 INFO org.apache.hadoop.hbase.master.LoadBalancer:
> Skipping load balancing because balanced cluster; servers=8 regions=126
> average=15.75 mostloaded=16 leastloaded=15
>
> Within the first minute, though, after adding the new nodes, I see a lot of
> activity like this:
>
> 2014-08-15 16:40:49,142 INFO org.apache.hadoop.hbase.master.HMaster:
> balance
>
> hri=usertable,user9673372036854775807,1408128925670.e3c3e82d52f54692a70e7b59aeb56b31.,
> src=172-19-152-247,60020,1408131968632,
> dest=172-19-149-231,60020,1408135240788
> 2014-08-15 16:40:49,142 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region
>
> usertable,user9673372036854775807,1408128925670.e3c3e82d52f54692a70e7b59aeb56b31.
> (offlining)
> 2014-08-15 16:40:49,142 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x147db34a2c80000 Creating unassigned node for
> e3c3e82d52f54692a70e7b59aeb56b31 in a CLOSING state
> 2014-08-15 16:40:49,181 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> New connection to 172-19-152-247,60020,1408131968632
> 2014-08-15 16:40:49,189 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> 172-19-152-247,60020,1408131968632 for region
>
> usertable,user9673372036854775807,1408128925670.e3c3e82d52f54692a70e7b59aeb56b31.
> 2014-08-15 16:40:49,189 INFO org.apache.hadoop.hbase.master.HMaster:
> balance
>
> hri=usertable,user1068528100307123131,1408128925659.1fe73c5f470b78e7e9d2a0691659967a.,
> src=172-19-152-241,60020,1408131968630,
> dest=172-19-149-89,60020,1408135238164
> 2014-08-15 16:40:49,189 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region
>
> usertable,user1068528100307123131,1408128925659.1fe73c5f470b78e7e9d2a0691659967a.
> (offlining)
> 2014-08-15 16:40:49,189 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x147db34a2c80000 Creating unassigned node for
> 1fe73c5f470b78e7e9d2a0691659967a in a CLOSING state
> 2014-08-15 16:40:49,223 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> New connection to 172-19-152-241,60020,1408131968630
> 2014-08-15 16:40:49,223 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENED, server=172-19-149-231,60020,1408135240788,
> region=7b08e35b1fb9f445d91a90ae89bb401a
> 2014-08-15 16:40:49,224 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED
> event for
>
> othertable,user1342640501535615655,1408128955190.7b08e35b1fb9f445d91a90ae89bb401a.
> from 172-19-149-231,60020,1408135240788; deleting unassigned node
> 2014-08-15 16:40:49,224 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x147db34a2c80000 Deleting existing unassigned node for
> 7b08e35b1fb9f445d91a90ae89bb401a that is in expected state
> RS_ZK_REGION_OPENED
> 2014-08-15 16:40:49,225 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> 172-19-152-241,60020,1408131968630 for region
>
> usertable,user1068528100307123131,1408128925659.1fe73c5f470b78e7e9d2a0691659967a.
> 2014-08-15 16:40:49,225 INFO org.apache.hadoop.hbase.master.HMaster:
> balance
>
> hri=usertable,user9873372036854775807,1408128925670.3eeb1e68835cbb98c9f796d437fe7ebc.,
> src=172-19-152-241,60020,1408131968630,
> dest=172-19-149-89,60020,1408135238164
> 2014-08-15 16:40:49,225 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region
>
> usertable,user9873372036854775807,1408128925670.3eeb1e68835cbb98c9f796d437fe7ebc.
> (offlining)
> 2014-08-15 16:40:49,225 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x147db34a2c80000 Creating unassigned node for
> 3eeb1e68835cbb98c9f796d437fe7ebc in a CLOSING state
> 2014-08-15 16:40:49,281 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x147db34a2c80000 Successfully deleted unassigned node for
> region 7b08e35b1fb9f445d91a90ae89bb401a in expected state
> RS_ZK_REGION_OPENED
>
> Thanks,
> Eric
>
>
>
> On Fri, Aug 22, 2014 at 2:24 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Suppose you add new nodes at time T, can you check master log between T
> and
> > T+45(min) to see how long region movements took ?
> >
> > Cheers
> >
> >
> > On Fri, Aug 22, 2014 at 11:19 AM, Eric K <codemonkeeez@gmail.com> wrote:
> >
> > > I'm experimenting with HBase (0.94.4) and Hadoop (1.0.4) runtime
> > balancers
> > > on just a very tiny 4 node cluster and finding that performance is bad
> > for
> > > an hour after adding new nodes, even though all the data is supposedly
> > > offloaded within a few minutes.
> > >
> > > Using YCSB-generated requests, I load the database with about 18 GB
> worth
> > > of data across 12 million records.  The keyspace is initially pre-split
> > to
> > > have ~30 regions per node with single replication of data.  Then I hit
> > > HBase with read requests from a set of clients so that there are 2000
> > > requests outstanding, and new requests are immediately made after
> replies
> > > are received.  While these requests are running, after about 2
> minutes, I
> > > double the nodes from 4 to 8, add the new node information to the
> slaves
> > > and regionservers files, start up the new datanode and regionserver
> > > processes, and call the hdfs balancer with the smallest possible
> > threshold
> > > of 1.  The hbase.balancer.period is also set to 10 seconds so as to
> > respond
> > > fast to new nodes.  The dfs.bandwidth.bandwidthPerSec is set to 8 MB/s
> > per
> > > node, but I have also tried higher numbers that don't bottleneck the
> > > offload rate and gotten similar results.
> > >
> > > The expected response is that about half the data and regions are
> > offloaded
> > > to the new nodes in the next few minutes, and the logs (and hdfsadmin
> > > reports) confirm that this is indeed happening.  However, I'm seeing
> the
> > > throughput drop from 3000 to 500 requests/sec when the nodes are added,
> > > with latencies jumping from ~1s to 4 or 5 seconds, and this poor
> > > performance persists for almost 45 minutes, when it abruptly gets
> better
> > to
> > > 4000 requests/sec and 500ms latencies.
> > >
> > > I'd appreciate any ideas as to what could be causing that 45 minute
> > > performance delay, and how I can debug this further.  If I add the new
> > > nodes and balance onto them before starting up any read requests,
> > > performance is much better from the start.
> > >
> > > Thanks,
> > > Eric
> > >
> >
>

--001a11333526c6b88405013e2103--