Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of espinozca@oclc.org designates
 132.174.29.209 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Extremely long flush times
Date: Thu, 16 Aug 2012 15:12:59 -0400
Message-ID: 
 <6548F17059905B48B2A6F28CE3692BAA0CE296AB@OAEXCH4SERVER.oa.oclc.org>
In-Reply-To: <1345142395.6174.YahooMailNeo@web121702.mail.ne1.yahoo.com>
Thread-Topic: Extremely long flush times
Thread-Index: Ac173plvGSdJhAaTQwun1IsXPLpyhgAA8t3g
References: 
 <52E301F960B30049ADEFBCCF1CCAEF59111989E3@OAEXCH4SERVER.oa.oclc.org>
 <CAGpTDNeVU33LAchszMU_iE6SExX1eLT286kD3rudEQTwXCDVuA@mail.gmail.com>
 <6548F17059905B48B2A6F28CE3692BAA0CE29691@OAEXCH4SERVER.oa.oclc.org>
 <1344667393.56061.YahooMailNeo@web121706.mail.ne1.yahoo.com>
 <1344670001.62389.YahooMailNeo@web121704.mail.ne1.yahoo.com>
 <1344670928.71313.YahooMailNeo@web121701.mail.ne1.yahoo.com>
 <1344807676.13385.YahooMailNeo@web121703.mail.ne1.yahoo.com>
 <1345142395.6174.YahooMailNeo@web121702.mail.ne1.yahoo.com>
From: "Espinoza,Carlos" <espinozca@oclc.org>
To: <dev@hbase.apache.org>,
	"lars hofhansl" <lhofhansl@yahoo.com>,
	<user@hbase.apache.org>

Lars,

Glad I could help. It was cool to see how you approached the problem and =
came to a solution. Thanks for being so quick addressing this!

Carlos

-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com]=20
Sent: Thursday, August 16, 2012 2:40 PM
To: user@hbase.apache.org; dev@hbase.apache.org; lars hofhansl
Subject: Re: Extremely long flush times

This is now committed to 0.94 (i.e. will be in 0.94.2) and 0.96. The fix =
turned out to be pretty simple (but in an intricate part of HBase)

Thanks for program demonstrating the problem Carlos, that was extremely =
helpful!

-- Lars


________________________________
 From: lars hofhansl <lhofhansl@yahoo.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>; =
"dev@hbase.apache.org" <dev@hbase.apache.org>; lars hofhansl =
<lhofhansl@yahoo.com>
Sent: Sunday, August 12, 2012 2:41 PM
Subject: Re: Extremely long flush times
=20
I filed HBASE-6561 for this (Jira is back).


----- Original Message -----
From: lars hofhansl <lhofhansl@yahoo.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>; =
"user@hbase.apache.org" <user@hbase.apache.org>
Cc:=20
Sent: Saturday, August 11, 2012 12:42 AM
Subject: Re: Extremely long flush times

A possible solution is to have the MemStoreScanner reseek eagerly (i.e. =
just iterate forward) for a bit (say 100 KVs or so).If that is not =
fruitful then issue the expensive reseek. I'll try that tomorrow.


(In this case the tailset created from the reseek often 300.000 or more =
entries in it. That is not necessarily a problem since it is not =
recreated.)

-- Lars


----- Original Message -----
From: lars hofhansl <lhofhansl@yahoo.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>; hbase-dev =
<dev@hbase.apache.org>
Cc:=20
Sent: Saturday, August 11, 2012 12:26 AM
Subject: Re: Extremely long flush times

It turns out the problem is not the loop in MemStoreScanner, but =
excessive calls to StoreScanner.reseek.

As a test I changed ScanWildcardColumnTracker.checkVersion to =
MatchCode.SKIP instead of MatchCode.SEEK_NEXT_COL (when the max number =
of versions is reached); and I do not see this behavior (I see the loop =
that would not go past 15 or so, happily go on until I stop the client).

Not sure what the conclusion would be. Seeking the memstore seems to be =
expensive, so it should only be done when many KV can be skipped with a =
seek, otherwise we should just iterate.
It is not clear how to find this out ahead of time.

I'm open to suggestions.

-- Lars


----- Original Message -----
From: lars hofhansl <lhofhansl@yahoo.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>; hbase-dev =
<dev@hbase.apache.org>
Cc:=20
Sent: Friday, August 10, 2012 11:43 PM
Subject: Re: Extremely long flush times

Ran your test code (thanks Carlos).

Found two things:
1. Store.internalFlushCache(...) should be calling =
StoreScanner.next(List<KeyValue>, int limit) - currently it does not set =
a limit.(But this is not the problem).

2. With jstack I found that the code is stuck in a loop in =
Memstore.MemstoreScanner.getNext(...)

Here's the relevant part of the jstack:
"IPC Server handler 6 on 60020" daemon prio=3D10 =
tid=3D0x00007f0574625000 nid=3D0x720c runnable [0x00007f05669e7000]
=A0=A0 java.lang.Thread.State: RUNNABLE
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getNext(Mem=
Store.java:726)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.seekInSubLi=
sts(MemStore.java:761)
=A0=A0=A0=A0=A0=A0=A0 - locked <0x00000000c4a8a860> (a =
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(MemS=
tore.java:800)
=A0=A0=A0=A0=A0=A0=A0 - locked <0x00000000c4a8a860> (a =
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(No=
nLazyKeyValueScanner.java:54)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValu=
eHeap.java:299)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.jav=
a:244)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.jav=
a:522)
=A0=A0=A0=A0=A0=A0=A0 - eliminated <0x00000000ccb54860> (a =
org.apache.hadoop.hbase.regionserver.StoreScanner)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:=
403)
=A0=A0=A0=A0=A0=A0=A0 - locked <0x00000000ccb54860> (a =
org.apache.hadoop.hbase.regionserver.StoreScanner)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:=
127)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextIntern=
al(HRegion.java:3459)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegi=
on.java:3406)
=A0=A0=A0=A0=A0=A0=A0 - locked <0x00000000c59ee610> (a =
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegi=
on.java:3423)
=A0=A0=A0=A0=A0=A0=A0 - locked <0x00000000c59ee610> (a =
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4171)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4144)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java=
:1958)
=A0=A0=A0=A0=A0=A0=A0 at =
sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at =
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI=
mpl.java:43)
=A0=A0=A0=A0=A0=A0=A0 at =
java.lang.reflect.Method.invoke(Method.java:601)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngi=
ne.java:364)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1389=
)


At the same time I find that flush cannot finish:

"regionserver60020.cacheFlusher" daemon prio=3D10 =
tid=3D0x00007f05749ab000 nid=3D0x71fe waiting for monitor entry =
[0x00007f05677f6000]
=A0=A0 java.lang.Thread.State: BLOCKED (on object monitor)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScan=
ner.java:443)
=A0=A0=A0=A0=A0=A0=A0 - waiting to lock <0x00000000ccb54860> (a =
org.apache.hadoop.hbase.regionserver.StoreScanner)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.Store.notifyChangedReadersObservers(=
Store.java:904)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.Store.updateStorefiles(Store.java:89=
3)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.Store.access$600(Store.java:107)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.commit(Store.=
java:2291)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.j=
ava:1455)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.j=
ava:1353)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1294=
)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStore=
Flusher.java:406)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStore=
Flusher.java:380)
=A0=A0=A0=A0=A0=A0=A0 at =
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.=
java:243)
=A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.run(Thread.java:722)


Both StoreScanner.updateReaders and StoreScanner.reseek are =
synchronized.


So the problem seems to be that MemStoreScanner loops forever in =
getNext(...). I took a jstack a bunch of times during execution, this =
always shows up.
Need to dig a bit more, I do not see a good way to deal with this, yet.

-- Lars


----- Original Message -----
From: "Espinoza,Carlos" <espinozca@oclc.org>
To: user@hbase.apache.org
Cc:=20
Sent: Friday, August 10, 2012 3:12 PM
Subject: RE: Extremely long flush times

We were able to replicate this behavior in a pseudo-distributed hbase
(hbase-0.94.1) environment. We wrote a test program that creates a test =
table "MyTestTable" and populates it with random rows, then it creates a =
row with 60,000 columns and repeatedly updates it. Each column has a 18 =
byte qualifier and a 50 byte value. In our tests, when we ran the =
program, we usually never got beyond 15 updates before it would flush =
for a really long time. The rows that are being updated are about 4MB =
each (minues any hbase metadata).

It doesn't seem like it's caused by GC. I turned on gc logging, and =
didn't see any long pauses. This is the gc log during the flush.
http://pastebin.com/vJKKXDx5

This is the regionserver log with debug on during the same flush =
http://pastebin.com/Fh5213mg

This is the test program we wrote.=20
http://pastebin.com/aZ0k5tx2

You should be able to just compile it, and run it against a running =
HBase cluster.
$ java TestTable

Carlos

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of =
Jean-Daniel Cryans
Sent: Monday, August 06, 2012 8:04 PM
To: user@hbase.apache.org
Subject: Re: Extremely long flush times

Hi Ron,

I think this is caused by GC, for example there's 7 minutes between the =
last blocking message and the end of the flush:

2012-08-06 11:44:36,348 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC =
Server handler 42 on 9009' on region
Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.:
memstore size 271.9m is >=3D than blocking 256.0m size
2012-08-06 11:51:28,139 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of =
~155.4m/162905016, currentsize=3D116.5m/122172880 for region =
Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in =
501337ms, sequenceid=3D1792264092, compaction requested=3Dtrue

During that time I would expect that clients got socket timeouts, tried =
to write again, and data would just pile up in the RPC queue.

Do you have the gc log handy? Also did you set the zookeeper session =
timeout higher than the default of 3 minutes?

Thx,

J-D

On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron <buckleyr@oclc.org> wrote:
> We ran into this situation today on our production instance, I'm=20
> wondering if anyone has seen or knows a way around it.
>
>
>
> We were repeatedly updating the same 60,000 column wide row.=A0 With=20
> each update, we were updating nearly every column of the row. Each=20
> update generated nearly 9 MB in the WAL.
>
> This worked OK most of the time. We saw frequent messages like this in

> the region server log, indicating a 2.6 second flush time:
>
> 2012-08-06 11:42:52,316 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush=20
> of ~155.4m/162931400, currentsize=3D2.7k/2760 for region=20
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in=20
> 2610ms, sequenceid=3D1792263953, compaction requested=3Dfalse
>
>
>
> However, sometimes we would see a flush times around 8 - 10 Minutes.
>
> 2012-08-06 11:51:28,139 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush=20
> of ~155.4m/162905016, currentsize=3D116.5m/122172880 for region=20
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in=20
> 501337ms, sequenceid=3D1792264092, compaction requested=3Dtrue
>
>
>
> During these long flushes, we would end up taking all the ipc handler=20
> slots into that region server that we had configured (50).=A0 It =
looked=20
> to our application like the region server was unavailable.
>
>
>
> After the long delay, the region server would continue running fine,=20
> but then do it again a few minutes later.
>
>
>
> I thought it was most likely something with Garbage Collection, so I=20
> got a stack trace during one of the events.=A0 The trace doesn't look=20
> like GC was running.
>
>
>
>
>
> http://pastebin.com/ANBiUzX4=A0 - Stack Trace when the region server =
was

> unreachable
>
> http://pastebin.com/naTYQxZR=A0 - Region Server Log during the same=20
> interval
>
>
>
>
>
> Versions:
>
> HBase 0.92.1, r1298924
>
> Hadoop 0.20.2-cdh3u3
>
>
>
>
>
> ----------------------------------------------------------------------
>
> Ron Buckley
>
> x6365
>
>
>