Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates
 74.125.82.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANH3+J1hzG0j3VW6hPztwNNpYrg5rSzKgBDdN-YB7o4GSZ049g@mail.gmail.com>
References: 
 <CANH3+J0+Y41vMrG=oh53KFbOF7gocK2xD-HJKCPcp1TEtHM9uQ@mail.gmail.com>
	<CAGpTDNc8RdoMpB+oFj4eoKFb=FM=0pAQk-Oi=mtERgKA9BmbMA@mail.gmail.com>
	<CANH3+J1hzG0j3VW6hPztwNNpYrg5rSzKgBDdN-YB7o4GSZ049g@mail.gmail.com>
Date: Wed, 28 Mar 2012 09:53:28 -0700
Message-ID: 
 <CALte62zdKNDtQZbYheiJL3OoTyt6O+B5sfq3dtsXqiNAOqLXbw@mail.gmail.com>
Subject: Re: Region server shutting down due to HDFS error
From: Ted Yu <yuzhihong@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=f46d043c06bc57b8bb04bc50726f

--f46d043c06bc57b8bb04bc50726f
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Eran:
The error indicated some zookeeper related issue.
Do you see KeeperException after the Error log ?

I searched 90 codebase but couldn't find the exact log phrase:

zhihyu$ find src/main -name '*.java' -exec grep "getting node's version in
CLOSI" {} \; -print
zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \;
-print

Cheers

On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <eran@gigya.com> wrote:

> I don't see any prior HDFS issues in the 15 minutes before this exception=
.
> The logs on the datanode reported as problematic are clean as well.
> However, I now see the log is full of errors like this:
> 2012-03-28 00:15:05,358 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processi=
ng
> close of gs_users,731481|S
> n=EC=92=AA=EF=88=AF=E3=9D=A8=E7=9C=B3=D4=AB=E4=82=A3=E2=AB=B0=3D=3D,13312=
26388691.29929cb2200b3541ead85e17b836ade5.
> 2012-03-28 00:15:05,359 WARN
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error
> getting node's version in CLOSIN
> G state, aborting close of
> gs_users,731481|Sn=EC=92=AA=EF=88=AF=E3=9D=A8=E7=9C=B3=D4=AB=E4=82=A3=E2=
=AB=B0=3D=3D,1331226388691.29929cb2200b3541ead85e17b836ade5.
>
> -eran
>
>
>
> On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Any chance we can see what happened before that too? Usually you
> > should see a lot more HDFS spam before getting that all the datanodes
> > are bad.
> >
> > J-D
> >
> > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <eran@gigya.com> wrote:
> > > Hi,
> > >
> > > We have region server sporadically stopping under load due supposedly
> to
> > > errors writing to HDFS. Things like:
> > >
> > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > while
> > > syncing
> > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad.
> Aborting..
> > >
> > > It's happening with a different region server and data node every tim=
e,
> > so
> > > it's not a problem with one specific server and there doesn't seem to
> be
> > > anything really wrong with either of them. I've already increased the
> > file
> > > descriptor limit, datanode xceivers and data node handler count. Any
> idea
> > > what can be causing these errors?
> > >
> > >
> > > A more complete log is here: http://pastebin.com/wC90xU2x
> > >
> > > Thanks.
> > >
> > > -eran
> >
>

--f46d043c06bc57b8bb04bc50726f--