hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: HBase fail-over/reliability issues
Date Sat, 08 May 2010 04:02:48 GMT
On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldassari@gmail.com> wrote:
> java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
>
This is the regionserver log?  Is this deploying the region?  It fails?

> Our cluster throughput goes from around 3k requests/second down to 500-1000
> and does not recover without manual intervention.  The region server log for
> that region says:
>
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> 10.24.166.74:50010 for file /hbase/users/73382377/data/312780071564432169
> for block -4841840178880951849:java.io.IOException: Got error in response to
> OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 for
> block -4841840178880951849
>
> INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, call
> get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> from 10.24.117.100:2365: error: java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
> java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
>
> The datanode log for 10.24.116.74 says:
>
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.24.166.74:50010, storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> infoPort=50075, ipcPort=50020):
> Got exception while serving blk_-4841840178880951849_50277 to /10.25.119.113
> :
> java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
>

Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?


> Running a major compaction on the users table fixed the problem the first
> time it happened, but this time the major compaction didn't fix it, so we're
> in the process of rebooting the whole cluster.  I'm wondering a few things:
>
> 1. What could trigger this problem?
> 2. Why can't the system fail over to another block/file/datanode/region
> server?  We're using 3x replication in HDFS, and we have 8 data nodes which
> double as our region servers.
> 3. Are there any best practices for achieving high availability in an HBase
> cluster?  How can I configure the system to gracefully (and automatically)
> handle these types of problems?
>

Let us know what your hadoop is and then we figure more on the issues above.
Thanks James,
St.Ack
P.S. Its eight node cluster on what kinda hw? (You've probably said in
the past and I can dig through mail -- just say -- and then what kind
of loading are you applying?  Ditto for if you've said this already)

Mime
View raw message