accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: ingest problems
Date Tue, 11 Feb 2014 16:28:49 GMT
On 2/11/14, 11:10 AM, Kesten Broughton wrote:
> Hi there,
> We have been experimenting with accumulo for about two months now.  Our
> biggest painpoint has been on ingest.
> Often we will have ingest process fail 2 or 3 times 3/4 of the way
> through an ingest and then on a final try it works, without any changes.

Funny, most times I hear that people consider Accumulo to handles ingest 
fairly well, but let's see what we can do to help.

We need a bit more information than what you provided here though: 
what's your "ingest process"? Are you using some other workflow library? 
Are you running MapReduce? Do you just have a Java class with a main 
method that uses a BatchWriter?

The fact that it "works sometimes" implies that the problem might be 
resource related.

> Once the ingest works, the cluster is usually stable for querying for
> weeks or months only requiring the occasional if there is a
> problem.
> Sometimes our ingest can be 24 hours long, and we need a stronger ingest
> story to be able to commit to accumulo.

You should be able to run ingest 24/7 with Accumulo without it falling 
over (I do regularly to stress-test it). The limitation should only be 
the disk-space you have available.

> Our cluster architecture has been:
> 3 hdfs datanodes overlaid with name node, secondary nn and accumulo
> master each collocated with a datanode, and a zookeeper server on each.
> We realize this is not optimal and are transitioning to separate
> hardware for zookeepers and name/secondary/accumulomaster nodes.
> However, the big concern is that sometimes a failed ingest will bork the
> whole cluster and we have to re-init accumulo with an accumulo init
> destroying all our data.
> We have experienced this on at least three different clusters of this
> description.

Can you be more specific than "bork the whole cluster"? Unless you're 
hitting a really nasty bug, there shouldn't be any way that a client 
writing data into Accumulo will destroy an instance.

> The most recent attempt was on a 65GB dataset.   The cluster had been up
> for over 24 hours.  The ingest test takes 40 mins and about 5 mins
> in, one of the datanodes failed.
> There were no error logs on the failed node, and the two other nodes had
> logs filled with zookeeper connection errors.  We were unable to recover
> the cluster and had to re-init.

Check both the log4j logs and the stdout/stderr redirection files for 
the datanode process. Typically, if you get an OOME, log4j gets torn 
down before that exception can be printed to the normal log files. 
"Silent" failures seem indicative of lack of physical resources 
(over-subscribed the node) on the box or insufficient resources provided 
to the processes (-Xmx was too small for the process).

> I know a vague description of problems is difficult to respond to, and
> the next time we have an ingest failure, i will bring specifics forward.
>   But I’m writing to know if
> 1.  Ingest failures are a known fail point for accumulo, or if we are
> perhaps unlucky/mis-configured.

No -- something else is going on here.

> 2.  Are there any guidelines for capturing ingest failures / determining
> root causes when errors don’t show up in the logs

For any help request, be sure to gather Accumulo, Hadoop and ZooKeeper 
versions, OS and Java versions. Capturing log files and stdout/stderr 
files are important; beware that if you restart the Accumulo process on 
that node, it will overwrite the stdout/stderr files, so make sure to 
copy them out of the way.

> 3.  Are there any means of checkpointing a data ingest, so that if a
> failure were to occur at hour 23.5 we could roll back to hour 23 and
> continue.  Client code could checkpoint and restart at the last one, but
> if the underlying accumulo cluster can’t be recovered, that’s of no use.

You can do anything you want in your client ingest code :)

Assuming that you're using a BatchWriter, if you manually call flush() 
and it returns without Exception, you can assume that all data up to 
that point written with that BatchWriter instance is "ingested". This 
can easily extrapolated: if you're ingesting CSV files, ensure that a 
flush() happens every 1000lines and denote that somewhere that your 
ingest process can advance itself to the appropriate place in the CSV 
file and proceed from where it left off.

> thanks,
> kesten

View raw message