accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject Performance during node failure
Date Fri, 08 Nov 2013 19:53:15 GMT
Hi all,

I have an 8-node cluster (1 name node, 7 data nodes), running accumulo 1.4.2, zookeeper 3.3.6,
and hadoop 1.0.3, and I have it optimized for ingest performance. My question has to do is
how to make the performance degrade gracefully under node failure.

1) When nodes fail, I assume that what happens is that Accumulo needs to migrate those tablets,
and hadoop needs to replicate the underlying data blocks. This seems to have a rather catastrophic
effect on ingest rates. Is there a way to make more gradually migrate tablets (starting with
more active ones) and replicate data blocks in order to not interfere with ingestion as severely?

2) What happens to BatchWriters when a tablet server fails that it is attempting to write
to? Will I need to start catching MutationRejected exceptions, will it block, or is there
some other failure mode?

3) This I believe is a separate issue from node failure, but I was seeing some very odd zookeeper
behavior, involving a number of timeouts. I currently have zookeeper running on all 7 data
nodes, with the batchwriters running on the name node. Basically, I was getting a number of
the following:
client session timed out ...
opening socket connection
socket connection established
session establishment complete
...
client session timed out ...
repeat

I would also occasionally get
session expired for /accumulo/fe7...
as well as
Zookeper.KeeperException$Connectionloss
Exception: KeeperErrorCode = Connectionloss
for /accumulo/f37.../tables/3b/state
at accumulo.core.zookeeper.ZooCache$2.run
accumulo.core.zookeeper.ZooCache.retry
accumulo.core.zookeeper.ZooCach.get
core.clientimpl.tables.getTableState
core.clientimpl.multiTableBatchWriter.getBatchWriter
myIngestorProcess.run

Does anyone know if this is an Accumulo problem, a Zookeeper problem, or something else (network
overly busy, etc.)?

Thanks,
Dvaid



Mime
View raw message