accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@clouderagovt.com>
Subject Re: ingest problems
Date Tue, 11 Feb 2014 16:24:25 GMT
Hi Kesten!

Could you tell us:

1) Accumulo version

2) HDFS + ZooKeeper versions

3) are you using the BatchWriter API, or bulk ingest?

4) what does your table design look like?

5) what does your source data look like?

6) what kind of hardware is on these 3 nodes? Memory, disks, CPU cores.

7) could you post your config files (minus any passwords, usernames,
machine names, or instance secrets) in a gist or pastebin so that I can see
them?

8) could you describe what the failure mode looks like a bit? Does the
monitor come up? Does a table remain offline or with unrecovered tablets?
On Feb 11, 2014 10:11 AM, "Kesten Broughton" <kbroughton@21ct.com> wrote:

> Hi there,
>
> We have been experimenting with accumulo for about two months now.  Our
> biggest painpoint has been on ingest.
> Often we will have ingest process fail 2 or 3 times 3/4 of the way
> through an ingest and then on a final try it works, without any changes.
>
> Once the ingest works, the cluster is usually stable for querying for
> weeks or months only requiring the occasional start-all.sh if there is a
> problem.
>
> Sometimes our ingest can be 24 hours long, and we need a stronger ingest
> story to be able to commit to accumulo.
> Our cluster architecture has been:
> 3 hdfs datanodes overlaid with name node, secondary nn and accumulo master
> each collocated with a datanode, and a zookeeper server on each.
> We realize this is not optimal and are transitioning to separate hardware
> for zookeepers and name/secondary/accumulomaster nodes.
> However, the big concern is that sometimes a failed ingest will bork the
> whole cluster and we have to re-init accumulo with an accumulo init
> destroying all our data.
> We have experienced this on at least three different clusters of this
> description.
>
> The most recent attempt was on a 65GB dataset.   The cluster had been up
> for over 24 hours.  The ingest test takes 40 mins and about 5 mins in, one
> of the datanodes failed.
> There were no error logs on the failed node, and the two other nodes had
> logs filled with zookeeper connection errors.  We were unable to recover
> the cluster and had to re-init.
>
> I know a vague description of problems is difficult to respond to, and the
> next time we have an ingest failure, i will bring specifics forward.  But
> I’m writing to know if
> 1.  Ingest failures are a known fail point for accumulo, or if we are
> perhaps unlucky/mis-configured.
> 2.  Are there any guidelines for capturing ingest failures / determining
> root causes when errors don’t show up in the logs
> 3.  Are there any means of checkpointing a data ingest, so that if a
> failure were to occur at hour 23.5 we could roll back to hour 23 and
> continue.  Client code could checkpoint and restart at the last one, but if
> the underlying accumulo cluster can’t be recovered, that’s of no use.
>
> thanks,
>
> kesten
>

Mime
View raw message