hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Region server shutting down due to HDFS error
Date Wed, 28 Mar 2012 15:21:26 GMT
Eran,

For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated"
to > 0 (default). This will help RS survive transient HLog sync
failures (with local DN) by retrying a few times before the RS decides
to shut itself down.

Also worth investigating if you had too much IO load/etc. on the box
that lead to the DN throwing up an error during sync().

P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222
will also be in CDH3u4.

On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <eran@gigya.com> wrote:
> Hi Jimmy,
> HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had
> the same problem with 0.90.4
> Hadoop 0.20.2 from Cloudera CDH3u1
>
> This failure happens during large M/R jobs, I have 10 servers and usually
> no more than 1 would fail like this, sometimes none.
> One thing worth mentioning is that the table it is trying to write to has
> over 5000 regions.
>
> -eran
>
>
>
> On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <jxiang@cloudera.com> wrote:
>
>> Which version of HDFS and HBase are you using?
>>
>> When the problem happens, can you access the HDFS, for example, from
>> hadoop dfs?
>>
>> Thanks,
>> Jimmy
>>
>> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <eran@gigya.com> wrote:
>> > Hi,
>> >
>> > We have region server sporadically stopping under load due supposedly to
>> > errors writing to HDFS. Things like:
>> >
>> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> while
>> > syncing
>> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting..
>> >
>> > It's happening with a different region server and data node every time,
>> so
>> > it's not a problem with one specific server and there doesn't seem to be
>> > anything really wrong with either of them. I've already increased the
>> file
>> > descriptor limit, datanode xceivers and data node handler count. Any idea
>> > what can be causing these errors?
>> >
>> >
>> > A more complete log is here: http://pastebin.com/wC90xU2x
>> >
>> > Thanks.
>> >
>> > -eran
>>



-- 
Harsh J

Mime
View raw message