hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase Exceptions on version 0.20.1
Date Fri, 23 Oct 2009 18:46:29 GMT
> 239 "Block blk_-xxx is not valid errors", 
> 522 "BlockInfo not found in volumeMap" errors,
> and 208 "BlockAlreadyExistsException"

I assume since you say they were found in Hadoop logs that these
appeared in the datanode and/or namenode logs. If not, and instead
these are from HBase logs, please correct my understanding. It seems to me that your HDFS
is sick. That's not particularly helpful, I know, but HBase is a client application of HDFS
and depends on its good functioning. Have you talked with anyone on or mailed the logs to
hdfs-user@hadoop.apache.org ? If so, what did they say?

> Are there plans to make hbase more resilient to load based failures?

Yes, this is definitely something we have done and continue to do. It's hard if you can't
trust your filesystem. 

Once I tried running HBase on top of KFS instead of HDFS. KFS did seem slower as is the conventional
wisdom but I had bigger problems... the chunkservers would randomly abort on my x86_64 nodes
and even after I gave Sriram system access for gdb stack dumps, there was no clear resolution.
On the other hand, if you get it working, it has working sync and append. HDFS won't have
a fully working sync until 0.21. YMMV. 

   - Andy





________________________________
From: elsif <elsif.then@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Wed, October 21, 2009 8:16:40 AM
Subject: Re: HBase Exceptions on version 0.20.1


While running the test on this cluster of 14 servers, the highest loads
I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the
node serving the .META. region.  All the machines are on a single
gigabit switch dedicated to the cluster.  The highest throughput between
nodes has been 21.4MBps Rx on the node hosting the .META. region. 

There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
in the hadoop logs over 12 hours of running the test.

I understand that I am loading the cluster - that is the point of the
test, but I don't think that this should result in data loss.  Failed
inserts at the client level I can handle, but loss of data that was
previously thought to be stored in hbase is a major issue.  Are there
plans to make hbase more resilient to load based failures?

Regards,
elsif

Andrew Purtell wrote:
> The reason JG points to load as being a problem as all signs point to it: This is usually
the culprit behind DFS "no live block" errors -- the namenode is too busy and/or falling behind,
or the datanodes are falling behind, or actually failing. Also, in the log snippets you provide,
HBase is complaining about writes to DFS (for the WAL) taking in excess of 2 seconds. Also
highly indicative of load, write load. Shortly after this, Zookeeper sessions begin expiring,
which is also usually indicative of overloading -- heartbeats miss their deadline. 
>
> When I see these signs on my test clusters, I/O wait is generally in excess of 40%. 
>
> If your total CPU load is really just a few % (user + system + iowait), then I'd suggest
you look at the storage layer. Is there anything in the datanode logs that seems like it might
be relevant?
>
> What about the network? Gigabit? Any potential sources of contention? Are you tracking
network utilization metrics during the test?
>
> Also, you might consider using Ganglia to monitor and correlate system metrics and HBase
and HDFS metrics during your testing, if you are not doing this already. 
>
>    - Andy
>
>  


      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message