Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CADcMMgFQ7EoyLpx4vq7Ee=KzH2yQPYLUpeA80S933aiqs4gYqg@mail.gmail.com>
References: 
 <CANGerjyO+k+cPSKvOqxF7qVK9wZVSNm9JWDnd4q8D11y3mfROw@mail.gmail.com>
	<CALte62w++kBMNTZGL-8a0CG3gB2=d2gz6mghrkbOqdjOTcYNEg@mail.gmail.com>
	<CANGerjwgs8qrEoV=ySPNVybBk-4TzvpNKnX+9Yuiiids-Edxug@mail.gmail.com>
	<CADcMMgEbQQz3xgF=sxr-X5QBen3DkxyLuBsmc_ekhAn-XKEsuw@mail.gmail.com>
	<CANGerjwWBD80_8S3uBnb47PYrqXfTaU2e8Q1HY1dzjd9MLEcGg@mail.gmail.com>
	<CADcMMgEDKyxR15twf1Fm4vaF5iFg=NgNz9Q9XO_L_8g4EtkOOA@mail.gmail.com>
	<CANGerjy_dqPoavFbyNPb9c59gxjq__BOoSeYpOuev887d1o+sw@mail.gmail.com>
	<CADcMMgFQ7EoyLpx4vq7Ee=KzH2yQPYLUpeA80S933aiqs4gYqg@mail.gmail.com>
Date: Mon, 21 Sep 2015 23:12:44 -0700
Message-ID: 
 <CALvzVW2AFkN+gzCG3t+NxAR5Ko7nE1yPvov0tu3bYDL1Z4xF4Q@mail.gmail.com>
Subject: Re: Performance degradation between CDH5.3.1(HBase0.98.6) and
 CDH5.4.5(HBase1.0.0)
From: Elliott Clark <eclark@apache.org>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=089e0160b770e58ea905204fe298

--089e0160b770e58ea905204fe298
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Commented up on the jira. But I think there's a pretty easy solution that
we can do that should be possible in the near future. We will continue to
have issues in situations that are highly contended on just a small number
of rows. But there's not a whole lot that I can see to make that situation
much faster.

On Mon, Sep 21, 2015 at 10:14 PM, Stack <stack@duboce.net> wrote:

> Back to this problem. Simple tests confirm that as is, the
> single-queue-backed MVCC instance can slow Region ops if some other row i=
s
> slow to complete. In particular Increment, checkAndPut, and batch mutatio=
ns
> are effected. I opened HBASE-14460 to start in on a fix up. Lets see if w=
e
> can somehow scope mvcc to row or at least shard mvcc so not all Region op=
s
> are paused.
>
> St.Ack
>
>
> On Mon, Sep 14, 2015 at 4:15 AM, =E9=88=B4=E6=9C=A8=E4=BF=8A=E8=A3=95 <br=
frn169@gmail.com> wrote:
>
> > > Thank you for the below reasoning (with accompanying helpful diagram)=
.
> > > Makes sense. Let me hack up a test case to help with the illustration=
.
> It
> > > is as though the mvcc should be scoped to a row only... Writes agains=
t
> > > other rows should not hold up my read of my row. Tag an mvcc with a
> 'row'
> > > scope so we can see which on-going writes pertain to current operatio=
n?
> > Thank you St.Ack! I think this approach would work.
> >
> > > You need to read back the increment and have it be 'correct' at
> increment
> > > time?
> > Yes, we need it.
> >
> > I would like to help if there is anything I can do.
> >
> > Thanks,
> > Toshihiro Suzuki
> >
> >
> > 2015-09-13 14:11 GMT+09:00 Stack <stack@duboce.net>:
> >
> > > Thank you for the below reasoning (with accompanying helpful diagram)=
.
> > > Makes sense. Let me hack up a test case to help with the illustration=
.
> It
> > > is as though the mvcc should be scoped to a row only... Writes agains=
t
> > > other rows should not hold up my read of my row. Tag an mvcc with a
> 'row'
> > > scope so we can see which on-going writes pertain to current operatio=
n?
> > >
> > > You need to read back the increment and have it be 'correct' at
> increment
> > > time?
> > >
> > > (This is a good one)
> > >
> > > Thank you Toshihiro Suzuki
> > > St.Ack
> > >
> > > On Sat, Sep 12, 2015 at 8:09 AM, =E9=88=B4=E6=9C=A8=E4=BF=8A=E8=A3=95=
 <brfrn169@gmail.com> wrote:
> > >
> > > > St.Ack,
> > > >
> > > > Thank you for your response.
> > > >
> > > > Why I make out that "A region lock (not a row lock) seems to occur =
in
> > > > waitForPreviousTransactionsComplete()" is as follows:
> > > >
> > > > A increment operation has 3 procedures for MVCC.
> > > >
> > > > 1. mvcc.waitForPreviousTransactionsComplete();
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
> > > >
> > > > 2. w =3D mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6721
> > > >
> > > > 3. mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6893
> > > >
> > > >
> > > > I think that MultiVersionConsistencyControl's writeQueue can cause =
a
> > > region
> > > > lock.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L42-L43
> > > >
> > > >
> > > > Step 2 adds to a WriteEntry to writeQueue.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L102-L108
> > > >
> > > > Step 3 removes the WriteEntry from writeQueue.
> > > >
> > > > completeMemstoreInsertWithSeqNum(w, walKey) ->
> > > > waitForPreviousTransactionsComplete(e) -> advanceMemstore(w)
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L127
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L235
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L160
> > > >
> > > > Step 1 adds a WriteEntry w in beginMemstoreInsert() to writeQueue a=
nd
> > > waits
> > > > until writeQueue is empty or writeQueue.getFirst() =3D=3D w.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L201-L204
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L206-L241
> > > >
> > > >
> > > > I think when a handler thread is processing between step 2 and step
> 3,
> > > the
> > > > other handler threads can wait at step 1 until the thread completes
> > step
> > > 3
> > > > This is depicted as follows:
> > > >
> > > >
> > > >
> > >
> >
> https://gist.githubusercontent.com/brfrn169/cb4f2c157129330cd932/raw/86d6=
aae5667b0fe006b16fed80f1b0c4945c7fd0/region_lock.png
> > > >
> > > >
> > > > Actually, in the thread dump of our region server, many handler
> threads
> > > > (RW.default.writeRpcServer.handler) wait at Step 1
> > > > (waitForPreviousTransactionsComplete()).
> > > >
> > > >
> > > >
> > >
> >
> https://gist.githubusercontent.com/brfrn169/cb4f2c157129330cd932/raw/86d6=
aae5667b0fe006b16fed80f1b0c4945c7fd0/thread_dump.txt
> > > >
> > > > Many handler threads wait at this:
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/=
main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyContr=
ol.java#L224
> > > >
> > > >
> > > > > Is it possible you are contending on a counter post-upgrade?  Is =
it
> > > > > possible that all these threads are trying to get to the same row
> to
> > > > update
> > > > > it? Could the app behavior have changed?  Or are you thinking
> > increment
> > > > > itself has slowed significantly?
> > > > We have just upgraded HBase, not changed the app behavior. We are
> > > thinking
> > > > increment itself has slowed significantly.
> > > > Before upgrading HBase, it was good throughput and latency.
> > > > Currently, to cope with this problem, we split the regions finely.
> > > >
> > > > Thanks,
> > > >
> > > > Toshihiro Suzuki
> > > >
> > > >
> > > > 2015-09-09 15:29 GMT+09:00 Stack <stack@duboce.net>:
> > > >
> > > > > On Tue, Sep 8, 2015 at 10:22 PM, =E9=88=B4=E6=9C=A8=E4=BF=8A=E8=
=A3=95 <brfrn169@gmail.com> wrote:
> > > > >
> > > > > > Ted,
> > > > > >
> > > > > > Thank you for your response.
> > > > > >
> > > > > > I uploaded the complete stack trace to Gist.
> > > > > >
> > > > > > https://gist.github.com/brfrn169/cb4f2c157129330cd932
> > > > > >
> > > > > >
> > > > > > I think that increment operation works as follows:
> > > > > >
> > > > > > 1. get row lock
> > > > > > 2. mvcc.waitForPreviousTransactionsComplete() // wait for all
> prior
> > > > MVCC
> > > > > > transactions to finish
> > > > > > 3. mvcc.beginMemstoreInsertWithSeqNum() // start a transaction
> > > > > > 4. get previous values
> > > > > > 5. create KVs
> > > > > > 6. write to Memstore
> > > > > > 7. write to WAL
> > > > > > 8. release row lock
> > > > > > 9. mvcc.completeMemstoreInsertWithSeqNum() // complete the
> > > transaction
> > > > > >
> > > > > > A instance of MultiVersionConsistencyControl has a pending queu=
e
> of
> > > > > writes
> > > > > > named writeQueue.
> > > > > > Step 2 puts a WriteEntry w to writeQueue and waits until
> writeQueue
> > > is
> > > > > > empty or writeQueue.getFirst() =3D=3D w.
> > > > > > Step 3 puts a WriteEntry to writeQueue and step 9 removes the
> > > > WriteEntry
> > > > > > from writeQueue.
> > > > > >
> > > > > > I think that when a handler thread is processing between step 2
> and
> > > > step
> > > > > 9,
> > > > > > the other handler threads can wait until the thread completes
> step
> > 9.
> > > > > >
> > > > > >
> > > > > That is right. We need to read, after all outstanding updates are
> > > done...
> > > > > because we need to read the latest update before we go to
> > > > modify/increment
> > > > > it.
> > > > >
> > > > > How do you make out this?
> > > > >
> > > > > "A region lock (not a row lock) seems to occur in
> > > > > waitForPreviousTransactionsComplete()."
> > > > >
> > > > > In 0.98.x we did this:
> > > > >
> > > > > mvcc.completeMemstoreInsert(mvcc.beginMemstoreInsert());
> > > > >
> > > > > ... and in 1.0 we do this:
> > > > >
> > > > > mvcc.waitForPreviousTransactionsComplete() which is this....
> > > > >
> > > > > +  public void waitForPreviousTransactionsComplete() {
> > > > > +    WriteEntry w =3D beginMemstoreInsert();
> > > > > +    waitForPreviousTransactionsComplete(w);
> > > > > +  }
> > > > >
> > > > > The mvcc and region sequenceid were merged in 1.0 (
> > > > > https://issues.apache.org/jira/browse/HBASE-8763). Previous mvcc
> and
> > > > > region
> > > > > sequenceid would spin independent of each other. Perhaps this
> > > responsible
> > > > > for some slow down.
> > > > >
> > > > > That said, looking in your thread dump, we seem to be down in the
> > Get.
> > > If
> > > > > you do a bunch of thread dumps in a row, where is the lock-holdin=
g
> > > > thread?
> > > > > In Get or writing Increment... or waiting on sequence id?
> > > > >
> > > > > Is it possible you are contending on a counter post-upgrade?  Is =
it
> > > > > possible that all these threads are trying to get to the same row
> to
> > > > update
> > > > > it? Could the app behavior have changed?  Or are you thinking
> > increment
> > > > > itself has slowed significantly?
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Toshihiro Suzuki
> > > > > >
> > > > > >
> > > > > > 2015-09-09 0:05 GMT+09:00 Ted Yu <yuzhihong@gmail.com>:
> > > > > >
> > > > > > > In HRegion#increment(), we lock the row (not region):
> > > > > > >
> > > > > > >     try {
> > > > > > >       rowLock =3D getRowLock(row);
> > > > > > >
> > > > > > > Can you pastebin the complete stack trace ?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Tue, Sep 8, 2015 at 2:01 AM, =E9=88=B4=E6=9C=A8=E4=BF=8A=
=E8=A3=95 <brfrn169@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We upgraded our cluster from CDH5.3.1(HBase0.98.6) to
> > > > > > > CDH5.4.5(HBase1.0.0)
> > > > > > > > and we experience slowdown in increment operation.
> > > > > > > >
> > > > > > > > Here's an extract from thread dump of the RegionServer of o=
ur
> > > > > cluster:
> > > > > > > >
> > > > > > > > Thread 68
> > > > (RW.default.writeRpcServer.handler=3D15,queue=3D5,port=3D60020):
> > > > > > > >   State: BLOCKED
> > > > > > > >   Blocked count: 21689888
> > > > > > > >   Waited count: 39828360
> > > > > > > >   Blocked on java.util.LinkedList@3474e4b2
> > > > > > > >   Blocked by 63
> > > > > > (RW.default.writeRpcServer.handler=3D10,queue=3D0,port=3D60020)
> > > > > > > >   Stack:
> > > > > > > >     java.lang.Object.wait(Native Method)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitF=
orPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitF=
orPreviousTransactionsComplete(MultiVersionConsistencyControl.java:203)
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.increment(HRegion.java:671=
2)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.increment(RSRpcService=
s.java:501)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutat=
ion(RSRpcServices.java:570)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.ja=
va:1901)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.c=
allBlockingMethod(ClientProtos.java:31451)
> > > > > > > >
> > >  org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035)
> > > > > > > >
> > >  org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130=
)
> > > > > > > >
> > > > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > > > > > >     java.lang.Thread.run(Thread.java:745)
> > > > > > > >
> > > > > > > > There are many similar threads in the thread dump.
> > > > > > > >
> > > > > > > > I read the source code and I think this is caused by change=
s
> of
> > > > > > > > MultiVersionConsistencyControl.
> > > > > > > > A region lock (not a row lock) seems to occur in
> > > > > > > > waitForPreviousTransactionsComplete().
> > > > > > > >
> > > > > > > >
> > > > > > > > Also we wrote performance test code for increment operation
> > that
> > > > > > included
> > > > > > > > 100 threads and ran it in local mode.
> > > > > > > >
> > > > > > > > The result is shown below:
> > > > > > > >
> > > > > > > > CDH5.3.1(HBase0.98.6)
> > > > > > > > Throughput(op/s): 12757, Latency(ms): 7.975072509210629
> > > > > > > >
> > > > > > > > CDH5.4.5(HBase1.0.0)
> > > > > > > > Throughput(op/s): 2027, Latency(ms): 49.11840157868772
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Toshihiro Suzuki
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--089e0160b770e58ea905204fe298--