hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Region server shutting down due to HDFS error
Date Thu, 05 Apr 2012 13:52:31 GMT
Thanks for writing back.

I guess you meant 'things are now operating well', below :-)

On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <eran@gigya.com> wrote:

> As promised I'm writing back to update the list.
> Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper
> ensemble (hadoop alone wasn't enough) things are no operating well with no
> HDFS errors in the logs. I've also set
> hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that the
> log is clean a new exception shows up but I'll open a separate thread about
> it.
>
> Thanks everyone.
>
> -eran
>
>
>
> On Wed, Mar 28, 2012 at 23:06, Eran Kutner <eran@gigya.com> wrote:
>
> > hmmm... I couldn't find it either, so I've looked at the history of that
> > file and sure enough a few check-ins back it had that message.
> > I have no idea how something like this could happen. I know I had some
> > merge issues when I first got the latest version and built that project
> but
> > I've then reverted all local changes and rebuilt. The only thing I can
> > imagine is that the previous compiled class file was not modified and it
> > was the one that got included in the JAR, although I don;t really know
> how
> > can it happen.
> >
> > -eran
> >
> >
> >
> > On Wed, Mar 28, 2012 at 18:53, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> Eran:
> >> The error indicated some zookeeper related issue.
> >> Do you see KeeperException after the Error log ?
> >>
> >> I searched 90 codebase but couldn't find the exact log phrase:
> >>
> >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's version
> in
> >> CLOSI" {} \; -print
> >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \;
> >> -print
> >>
> >> Cheers
> >>
> >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <eran@gigya.com> wrote:
> >>
> >> > I don't see any prior HDFS issues in the 15 minutes before this
> >> exception.
> >> > The logs on the datanode reported as problematic are clean as well.
> >> > However, I now see the log is full of errors like this:
> >> > 2012-03-28 00:15:05,358 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> >> Processing
> >> > close of gs_users,731481|S
> >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5.
> >> > 2012-03-28 00:15:05,359 WARN
> >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error
> >> > getting node's version in CLOSIN
> >> > G state, aborting close of
> >> >
> >>
> gs_users,731481|Sn쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5.
> >> >
> >> > -eran
> >> >
> >> >
> >> >
> >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> > >wrote:
> >> >
> >> > > Any chance we can see what happened before that too? Usually you
> >> > > should see a lot more HDFS spam before getting that all the
> datanodes
> >> > > are bad.
> >> > >
> >> > > J-D
> >> > >
> >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <eran@gigya.com>
> wrote:
> >> > > > Hi,
> >> > > >
> >> > > > We have region server sporadically stopping under load due
> >> supposedly
> >> > to
> >> > > > errors writing to HDFS. Things like:
> >> > > >
> >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> >> > > while
> >> > > > syncing
> >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad.
> >> > Aborting..
> >> > > >
> >> > > > It's happening with a different region server and data node every
> >> time,
> >> > > so
> >> > > > it's not a problem with one specific server and there doesn't
seem
> >> to
> >> > be
> >> > > > anything really wrong with either of them. I've already increased
> >> the
> >> > > file
> >> > > > descriptor limit, datanode xceivers and data node handler count.
> Any
> >> > idea
> >> > > > what can be causing these errors?
> >> > > >
> >> > > >
> >> > > > A more complete log is here: http://pastebin.com/wC90xU2x
> >> > > >
> >> > > > Thanks.
> >> > > >
> >> > > > -eran
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message