hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)
Date Wed, 28 Oct 2009 18:56:49 GMT
These client error messages are not particular descriptive as to the 
root cause (they are fatal errors, or close to it).

What is going on in your regionservers when these errors happen?  Check 
the master and RS logs.

Also, you definitely do not want 19 zookeeper nodes.  Reduce that to 3 
or 5 max.

What is the hardware you are using for these nodes, and what settings do 
you have for heap/GC?

JG

Zhenyu Zhong wrote:
> Stack,
> 
> Thank you very much for your comments.
> I am running a cluster with 20 nodes. I set 19 as both regionserver and
> zookeeper quorums.
> The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
> I started with an empty table and try to load 200 million records into that
> empty table.
> There is a key in each record. Logically, in my MR program, during the
> setup, I opened an HTable, in my mapper, I fetch the record from HTable via
> key in the record, then make some changes to the columns and update that row
> back to HTable through TableOutputFormat by passing a put. There is no
> reduce tasks involved here.  (Though it is unnecessary to fetch row from an
> empty table, I just intended to do that)
> 
> Additionally, when I reduced the number of regionservers and number of
> zookeeper quorums.
> I had different errors:
> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
> to locate root region at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at
> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
> org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> Many thanks in advance.
> zhenyu
> 
> 
> 
> 
> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@duboce.net> wrote:
> 
>> Whats your cluster topology?  How many nodes involved?  When you see the
>> below message, how many regions in your table?  How are you loading your
>> table?
>> Thanks,
>> St.Ack
>>
>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@gmail.com
>>> wrote:
>>> Nitay,
>>>
>>> I am very appreciated.
>>>
>>> As Ryan suggested, I increased the zookeeper session timeout to 40seconds
>>> along with the GC options -XX:ParallelGCThreads=8
>>  -XX:+UseConcMarkSweepGC
>>> in place. I set the Heapsize to 4GB.  I also set the vm.swappiness=0.
>>>
>>> However it still ran into problem. Please find the following errors.
>>>
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> contact region server x.x.x.x:60021 for region
>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10
>>> attempts.
>>> Exceptions:
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>
>>>        at
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)
>>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)
>>>
>>>
>>> The input file is about 10GB around 200million rows of data.
>>> This load doesn't seem too large. However this kind of errors keep
>> popping
>>> up.
>>>
>>> Does Regionserver need to be deployed to dedicated machines?
>>> Does Zookeeper need to be deployed to dedicated machines as well?
>>>
>>> Best,
>>> zhenyu
>>>
>>>
>>>
>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@gmail.com> wrote:
>>>
>>>> Hi Zhenyu,
>>>>
>>>> Sorry for the delay. I started working on this a while back, before I
>>> left
>>>> my job for another company. Since then I haven't had much time to work
>> on
>>>> HBase unfortunately :(. I'll try to dig up what I had and see what
>> shape
>>>> it's in and update you.
>>>>
>>>> Cheers,
>>>> -n
>>>>
>>>>
>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:
>>>>
>>>>  Sorry I must have mistyped, I meant to say "40 seconds".  You can
>>>>> still see multi-second pauses at times, so you need to give yourself
a
>>>>> bigger buffer.
>>>>>
>>>>> The parallel threads argument should not be necessary, but you do need
>>>>> the UseConcMarkSweepGC flag as well.
>>>>>
>>>>> Let us know how it goes!
>>>>> -ryan
>>>>>
>>>>>
>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <
>> zhongresearch@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ryan,
>>>>>> I am very appreciated for your feedbacks.
>>>>>> I have set the zookeeper.session.timeout to seconds which is way
>> higher
>>>>>> than
>>>>>> 40ms.
>>>>>> In the same time, the -Xms is set to 4GB, which should be sufficient.
>>>>>> I also tried GC options like
>>>>>>
>>>>>>  -XX:ParallelGCThreads=8
>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>
>>>>>> I even set the vm.swappiness=0
>>>>>>
>>>>>> However, I still came across the problem that a RegionServer shutdown
>>>>>> itself.
>>>>>>
>>>>>> Best,
>>>>>> zhong
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@gmail.com>
>>> wrote:
>>>>>>  Set the ZK timeout to something like 40ms, and give the GC enough
>> Xmx
>>>>>>> so you never risk entering the much dreaded concurrent-mode-failure
>>>>>>> whereby the entire heap must be GCed.
>>>>>>>
>>>>>>> Consider testing Java 7 and the G1 GC.
>>>>>>>
>>>>>>> We could get a JNI thread to do this, but no one has done so
yet. I
>> am
>>>>>>> personally hoping for G1 and in the meantime overprovision our
Xmx
>> to
>>>>>>> avoid the concurrent mode failures.
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <
>>> zhongresearch@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ryan,
>>>>>>>>
>>>>>>>> Thank you very much.
>>>>>>>> May I ask whether there are any ways to get around this problem
to
>>> make
>>>>>>>> HBase more stable?
>>>>>>>>
>>>>>>>> best,
>>>>>>>> zhong
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  There isnt any working code yet. Just an idea, and a prototype.
>>>>>>>>> There is some sense that if we can get the G1 GC that
we could get
>>> rid
>>>>>>>>> of all long pauses, and avoid the need for this.
>>>>>>>>>
>>>>>>>>> -ryan
>>>>>>>>>
>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
>>>>>>>>> zhongresearch@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am very interesting to the solution that Joey proposed
and
>> would
>>>>>>>>> like
>>>>>>>> to
>>>>>>>>>> have a try.
>>>>>>>>>> Does anyone have any ideas on how to deploy this
zk_wrapper in
>> JNI
>>>>>>>>>> integration?
>>>>>>>>>>
>>>>>>>>>> I would be very appreciated.
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>> zhong
>>>>>>>>>>
>>>>>>>>>>
> 

Mime
View raw message