Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of
 Srikanth_Shreenivas@mindtree.com designates 119.226.208.136 as permitted
 sender)
From: "Srikanth P. Shreenivas" <Srikanth_Shreenivas@mindtree.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Subject: RE: HBase Read and Write Issues in Mutlithreaded Environments
Thread-Topic: HBase Read and Write Issues in Mutlithreaded Environments
Thread-Index: AQHMPlLtDz9TDY+PSUuXKAzJ79shZJTlblRg
Date: Sun, 10 Jul 2011 11:50:57 +0000
Message-ID: <B6743095BEE2E042B73BA11B8EFE32BF9F97@MTW02MBX02.mindtree.com>
References: <021AE257773CCB4CB70C124A0C882E400DC9D0CA@MTW02MBX01.mindtree.com>
	<BANLkTinFRgH2yKdRHsAv2wn5yWX=ust8UQ@mail.gmail.com>
	<021AE257773CCB4CB70C124A0C882E400DC9D3C9@MTW02MBX01.mindtree.com>
	<BANLkTinkO6NGr84Ki6bw2LXiv9r_ph1b6w@mail.gmail.com>
	<B6743095BEE2E042B73BA11B8EFE32BF9D26@MTW02MBX02.mindtree.com>
 <CADcMMgGcMxTZU=+ovMuZosLwoQoUbQsUaXHosuwWZpMRVDf3tQ@mail.gmail.com>
In-Reply-To: 
 <CADcMMgGcMxTZU=+ovMuZosLwoQoUbQsUaXHosuwWZpMRVDf3tQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi St.Ack,

I noticed that one of the region server machines had time running one day i=
n future.
I corrected the date. I ran into some issues after restarting, I was gettin=
g error with respect to .META. and stuff which I did not understand much.  =
Also, status command in hbase shell was displaying "3 servers, 1 dead" wher=
eas I had only 3 region server.

So, I cleaned the "/hbase" (to get to real problem) and restarted the hbase=
 nodes.

After starting all the 3 nodes of HBase, I ran the test app again and was o=
bserving the log files of all the 3 region servers.
I noticed that when test app seemed hung, the web app's thread that was ser=
ving the request has gone to sleep at below code. I think it stayed like th=
at for around 10 minutes before Tomcat probably interrupted it.

Thread-#8 - Thread t@29
   java.lang.Thread.State: TIMED_WAITING
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.locateRegionInMeta(HConnectionManager.java:791)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.locateRegion(HConnectionManager.java:589)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.relocateRegion(HConnectionManager.java:564)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.getRegionLocation(HConnectionManager.java:415)
	at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerC=
allable.java:57)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.getRegionServerWithRetries(HConnectionManager.java:1002)
	at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:514)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:13=
3)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:95=
)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.prefetchRegionCache(HConnectionManager.java:648)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.locateRegionInMeta(HConnectionManager.java:702)
	- locked java.lang.Object@75826e08
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.locateRegion(HConnectionManager.java:593)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.relocateRegion(HConnectionManager.java:564)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.getRegionLocation(HConnectionManager.java:415)
	at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerC=
allable.java:57)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa=
tion.getRegionServerWithRetries(HConnectionManager.java:1002)
	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
	<.. app specific trace removed ...>
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecut=
or.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j=
ava:908)
	at java.lang.Thread.run(Thread.java:619)

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
After 10 minutes, web app log showed:
2011-07-10 16:50:28,804 [Thread-#8] ERROR [persistence.handler.HBaseHandler=
]  - Exception occurred in searchData:
java.io.IOException: Giving up trying to get region server: thread is inter=
rupted.
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImp=
lementation.getRegionServerWithRetries(HConnectionManager.java:1016)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
I did not see anything happening on region server either, the log had occas=
ional entries like these:

2011-07-10 16:43:53,648 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCach=
e: LRU Stats: total=3D6.52 MB, free=3D788.08 MB, max=3D794.6 MB, blocks=3D0=
, accesses=3D1080, hits=3D0, hitRatio=3D0.00%%, cachingAccesses=3D0, cachin=
gHits=3D0, cachingHitsRatio=3D=EF=BF=BD%, evictions=3D0, evicted=3D0, evict=
edPerRun=3DNaN
2011-07-10 16:48:53,649 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCach=
e: LRU Stats: total=3D6.52 MB, free=3D788.08 MB, max=3D794.6 MB, blocks=3D0=
, accesses=3D1080, hits=3D0, hitRatio=3D0.00%%, cachingAccesses=3D0, cachin=
gHits=3D0, cachingHitsRatio=3D=EF=BF=BD%, evictions=3D0, evicted=3D0, evict=
edPerRun=3DNaN
2011-07-10 16:53:53,648 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCach=
e: LRU Stats: total=3D6.52 MB, free=3D788.08 MB, max=3D794.6 MB, blocks=3D0=
, accesses=3D1080, hits=3D0, hitRatio=3D0.00%%, cachingAccesses=3D0, cachin=
gHits=3D0, cachingHitsRatio=3D=EF=BF=BD%, evictions=3D0, evicted=3D0, evict=
edPerRun=3DNaN
2


Regards,
Srikanth


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Saturday, July 09, 2011 9:41 PM
To: user@hbase.apache.org
Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments

You read the requirements section in our docs and you have upped the
ulimits, nprocs, etc?  http://hbase.apache.org/book/os.html

If you know the row, can you deduce the regionserver its talking too?
(Below is the client failure -- we need to figure whats up on
server-side).  Once you've done that, can you check its logs?  See if
you can figure anything on why the hang?

Thanks,
St.Ack

On Sat, Jul 9, 2011 at 6:14 AM, Srikanth P. Shreenivas
<Srikanth_Shreenivas@mindtree.com> wrote:
> Hi St.Ack,
>
> We upgraded to CDH 3 (hadoop-0.20-0.20.2+923.21-1.noarch.rpm, hadoop-hbas=
e-0.90.1+15.18-1.noarch.rpm, hadoop-zookeeper-3.3.3+12.1-1.noarch.rpm).
>
> I ran a the same test which I was running for the app when it was running=
 on CDH2. =A0The test app posts a request the web app every 100ms, and the =
web app reads a HBase record, performs some logic, and saves an audit trail=
 by writing another HBase record.
>
> When our app was running on CDH2, I observed the below issue for every 10=
 to 15 requests.
> With CDH3, this issue is not happening at all. =A0So, seems like situatio=
n has improved a lot, and our app seems to be lot more stable.
>
> However, I am still seeing an issue though. =A0There are many requests (a=
round 1%) which are not able to read the record from the HBase, and the get=
 call is hanging for almost 10 minutes. =A0This is what I see in applicatio=
n log:
>
> 2011-07-09 18:27:25,537 [gridgain-#6%authGrid%] ERROR [my.app.HBaseHandle=
r] =A0- Exception occurred in searchData:
> java.io.IOException: Giving up trying to get region server: thread is int=
errupted.
> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$HConn=
ectionImplementation.getRegionServerWithRetries(HConnectionManager.java:101=
6)
> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HTable.get(HTable.java:5=
46)
>
> =A0 =A0 =A0 =A0<...app specific trace removed...>
>
> =A0 =A0 =A0 =A0at java.util.concurrent.Executors$RunnableAdapter.call(Exe=
cutors.java:441)
> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask$Sync.innerRun(FutureTas=
k.java:303)
> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask.run(FutureTask.java:138=
)
> =A0 =A0 =A0 =A0at org.gridgain.grid.util.runnable.GridRunnable.run(GridRu=
nnable.java:194)
> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(=
ThreadPoolExecutor.java:886)
> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thre=
adPoolExecutor.java:908)
> =A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619)
>
>
> I am running the test on the same record, so all by "get" are for same ro=
w id.
>
>
>
> It will be of immense help if you can provide some inputs on whether we a=
re missing some configuration settings, or is there a way to get around thi=
s.
>
> Thanks,
> Srikanth
>
>
>
>
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Wednesday, June 29, 2011 7:48 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments
>
> Go to CDH3 if you can. =A0CDH2 is also old.
> St.Ack
>
> On Wed, Jun 29, 2011 at 7:15 AM, Srikanth P. Shreenivas
> <Srikanth_Shreenivas@mindtree.com> wrote:
>> Thanks St. Ack for the inputs.
>>
>> Will upgrading to CDH3 help or is there a version within CDH2 that you r=
ecommend we should upgrade to?
>>
>> Regards,
>> Srikanth
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stac=
k
>> Sent: Wednesday, June 29, 2011 11:16 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments
>>
>> Can you upgrade? =A0That release is > 18 months old. =A0A bunch has
>> happened in the meantime.
>>
>> For retries exhausted, check whats going on on the remote regionserver
>> that you are trying to write too. =A0Its probably struggling and thats
>> why requests are not going through -- or the client missed the fact
>> that region moved (all stuff that should be working better in latest
>> hbase).
>>
>> St.Ack
>>
>> On Tue, Jun 28, 2011 at 9:51 PM, Srikanth P. Shreenivas
>> <Srikanth_Shreenivas@mindtree.com> wrote:
>>> Hi,
>>>
>>> We are using HBase 0.20.3 (hbase-0.20-0.20.3-1.cloudera.noarch.rpm) clu=
ster in distributed mode with Hadoop 0.20.2 (hadoop-0.20-0.20.2+320-1.noarc=
h).
>>> We are using pretty much default configuration, and only thing we have =
customized is that we have allocated 4GB RAM in /etc/hbase-0.20/conf/hbase-=
env.sh
>>>
>>> In our setup, we have a web application that reads a record from HBase =
and writes a record as part of each web request. =A0 The application is hos=
ted in Apache Tomcat 7 and is a stateless web application providing a REST-=
like web service API.
>>>
>>> We are observing that our reads and writes times out once in a =A0while=
. =A0This happens more for writes.
>>> We see below exception in our application logs:
>>>
>>>
>>> Exception Type 1 - During Get:
>>> ---------------------------------------
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to con=
tact region server 10.1.68.36:60020 for region employeedata,be8784ac8b57c45=
625a03d52be981b88097c2fdc,1308657957879, row 'd51b74eb05e07f96cee0ec556f5d8=
d161e3281f3', but failed after 10 attempts.
>>> Exceptions:
>>> java.io.IOException: Call to /10.1.68.36:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>> java.nio.channels.ClosedByInterruptException
>>>
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$Tab=
leServers.getRegionServerWithRetries(HConnectionManager.java:1048)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HTable.get(HTable.java=
:417)
>>> =A0 =A0 <snip>
>>>
>>> Exception =A0Type 2 - During Put:
>>> ---------------------------------------------
>>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Tr=
ying to contact region server 10.1.68.34:60020 for region audittable,,13091=
83872019, row '2a012017120f80a801b28f5f66a83dc2a8882d1b', but failed after =
10 attempts.
>>> Exceptions:
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local exceptio=
n: java.nio.channels.ClosedByInterruptException
>>>
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$Tab=
leServers.getRegionServerWithRetries(HConnectionManager.java:1048)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$Tab=
leServers$3.doCall(HConnectionManager.java:1239)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$Tab=
leServers$Batch.process(HConnectionManager.java:1161)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HConnectionManager$Tab=
leServers.processBatchOfRows(HConnectionManager.java:1247)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HTable.flushCommits(HT=
able.java:609)
>>> =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.client.HTable.put(HTable.java=
:474)
>>> =A0 =A0 <snip>
>>>
>>> Any inputs on why this is happening, or how to rectify it will be of im=
mense help.
>>>
>>> Thanks,
>>> Srikanth
>>>
>>>
>>>
>>> Srikanth P Shreenivas|Principal Consultant | MindTree Ltd.|Global Villa=
ge, RVCE Post, Mysore Road, Bangalore-560 059, INDIA|Voice +91 80 26264000 =
/ Fax +91 80 2626 4100| Mob: 9880141059|email: srikanth_shreenivas@mindtree=
.com<mailto:asha_venkatesh@mindtree.com> |www.mindtree.com<http://www.mindt=
ree.com/> |
>>>
>>>
>>> ________________________________
>>>
>>> http://www.mindtree.com/email/disclaimer.html
>>>
>>
>