hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject HBase always corrupted
Date Wed, 07 Apr 2010 16:55:07 GMT
Hi,
 
I am running an HBase instance in a pseudocluster mode, on top of a
pseudoclustered HDFS, on a single machine. I have a 10 node map/reduce
cluster that is using a TableMapper to drive a map/reduce job. In the
map phase, two Gets are executed against against HBase. The Map phase
generates two orders of magnitude more data than was pumped in, and in
the reduce phase we do some consolidation of the generated data, then
execute a Put into HBase with autocomit=false, and the batch size set to
100,000 (I tried 1000,10000 as well and found 100,000 worked best). I am
using 32 reducers, and reduce seems to run 1000X slower than mapping.
 
Unfortunately, the job consistently crashes around 85% reduce
completion, with HDFS related errors from the HBase machine:
 
java.io.IOException: java.io.IOException: All datanodes 127.0.0.1:50010
are bad. Aborting...
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DF
SClient.java:2525)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.j
ava:2078)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
ent.java:2241)
So I am clearly aware of the mismatch betweem the  big mapreduce
cluster, and the wimpy HBase installation, but why am I seeing
consistent crashes? Shouldn't the HBase cluster just be slower, not
unreliable? 
Here is my main question: should I expect that running a "real" HBase
cluster will solve my problems and does anyone have experience with a
map/reduce job that pumps several billion rows into HBase?
-geoff

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message