hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: HBase fail-over/reliability issues
Date Sat, 08 May 2010 04:30:01 GMT
If you can grep for '4841840178880951849' as well
as /hbase/users/73382377/data/312780071564432169 across all of your datanode
logs plus your NN, and put that online somewhere, that would be great. If
you can grep with -C 20 to get some context that would help as well.

Grepping for the region in question (73382377) in the RS logs would also be
helpful.

Thanks
-Todd

On Fri, May 7, 2010 at 9:16 PM, James Baldassari <jbaldassari@gmail.com>wrote:

> On Sat, May 8, 2010 at 12:02 AM, Stack <stack@duboce.net> wrote:
>
> > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldassari@gmail.com>
> > wrote:
> > > java.io.IOException: Cannot open filename
> > > /hbase/users/73382377/data/312780071564432169
> > >
> > This is the regionserver log?  Is this deploying the region?  It fails?
> >
>
> This error is on the client accessing HBase.  This exception was thrown on
> a
> get call to an HTable instance.  I'm not sure if it was deploying the
> region.  All I know is that the system had been running with all regions
> available (as far as I know), and then all of a sudden these errors started
> showing up on the client.
>
>
> >
> > > Our cluster throughput goes from around 3k requests/second down to
> > 500-1000
> > > and does not recover without manual intervention.  The region server
> log
> > for
> > > that region says:
> > >
> > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> > > 10.24.166.74:50010 for file
> > /hbase/users/73382377/data/312780071564432169
> > > for block -4841840178880951849:java.io.IOException: Got error in
> response
> > to
> > > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169
> for
> > > block -4841840178880951849
> > >
> > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
> > call
> > > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> > > timeRange=[0,9223372036854775807), families={(family=data,
> columns=ALL})
> > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open
> > filename
> > > /hbase/users/73382377/data/312780071564432169
> > > java.io.IOException: Cannot open filename
> > > /hbase/users/73382377/data/312780071564432169
> > >
> > > The datanode log for 10.24.116.74 says:
> > >
> > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > 10.24.166.74:50010,
> > storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> > > infoPort=50075, ipcPort=50020):
> > > Got exception while serving blk_-4841840178880951849_50277 to /
> > 10.25.119.113
> > > :
> > > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
> > >
> >
> > Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?
> >
>
> Hadoop is vanilla CDH 2.  HBase is 0.20.3 + HBase-2180
>
>
> >
> >
> > > Running a major compaction on the users table fixed the problem the
> first
> > > time it happened, but this time the major compaction didn't fix it, so
> > we're
> > > in the process of rebooting the whole cluster.  I'm wondering a few
> > things:
> > >
> > > 1. What could trigger this problem?
> > > 2. Why can't the system fail over to another block/file/datanode/region
> > > server?  We're using 3x replication in HDFS, and we have 8 data nodes
> > which
> > > double as our region servers.
> > > 3. Are there any best practices for achieving high availability in an
> > HBase
> > > cluster?  How can I configure the system to gracefully (and
> > automatically)
> > > handle these types of problems?
> > >
> >
> > Let us know what your hadoop is and then we figure more on the issues
> > above.
> >
>
> If you need complete stack traces or any additional information, please let
> me know.
>
>
> > Thanks James,
> > St.Ack
> > P.S. Its eight node cluster on what kinda hw? (You've probably said in
> > the past and I can dig through mail -- just say -- and then what kind
> > of loading are you applying?  Ditto for if you've said this already)
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message