Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates
 209.85.160.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANH3+J1Lt7EY0_AR9HBzxJJzUP2N24WknUX13mptHXxzVpuOYg@mail.gmail.com>
References: 
 <CANH3+J0+Y41vMrG=oh53KFbOF7gocK2xD-HJKCPcp1TEtHM9uQ@mail.gmail.com>
 <CAOmV22tZNPSuUFPojpa1xEYUjioewi_id1mcqAz7Nu1rsoPsEg@mail.gmail.com>
 <CANH3+J1Lt7EY0_AR9HBzxJJzUP2N24WknUX13mptHXxzVpuOYg@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Wed, 28 Mar 2012 20:51:26 +0530
Message-ID: 
 <CAOcnVr0-mBD519UHudJ-bqAcSsshq5ngwUt-STnad4mzLYDF6Q@mail.gmail.com>
Subject: Re: Region server shutting down due to HDFS error
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Eran,

For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated"
to > 0 (default). This will help RS survive transient HLog sync
failures (with local DN) by retrying a few times before the RS decides
to shut itself down.

Also worth investigating if you had too much IO load/etc. on the box
that lead to the DN throwing up an error during sync().

P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222
will also be in CDH3u4.

On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <eran@gigya.com> wrote:
> Hi Jimmy,
> HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had
> the same problem with 0.90.4
> Hadoop 0.20.2 from Cloudera CDH3u1
>
> This failure happens during large M/R jobs, I have 10 servers and usually
> no more than 1 would fail like this, sometimes none.
> One thing worth mentioning is that the table it is trying to write to has
> over 5000 regions.
>
> -eran
>
>
>
> On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <jxiang@cloudera.com> wrote:
>
>> Which version of HDFS and HBase are you using?
>>
>> When the problem happens, can you access the HDFS, for example, from
>> hadoop dfs?
>>
>> Thanks,
>> Jimmy
>>
>> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <eran@gigya.com> wrote:
>> > Hi,
>> >
>> > We have region server sporadically stopping under load due supposedly to
>> > errors writing to HDFS. Things like:
>> >
>> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> while
>> > syncing
>> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting..
>> >
>> > It's happening with a different region server and data node every time,
>> so
>> > it's not a problem with one specific server and there doesn't seem to be
>> > anything really wrong with either of them. I've already increased the
>> file
>> > descriptor limit, datanode xceivers and data node handler count. Any idea
>> > what can be causing these errors?
>> >
>> >
>> > A more complete log is here: http://pastebin.com/wC90xU2x
>> >
>> > Thanks.
>> >
>> > -eran
>>


-- 
Harsh J