accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject Re: Accumulo v1.4.1 - ran out of memory and lost data
Date Mon, 28 Jan 2013 13:53:38 GMT
What version of accumulo was this?

So, you have evidence (such as a message in a log) that the tablet server
ran out of memory?  Can you post that information?

The ingested data should have been captured in the write-ahead log, and
recovered when the server was restarted.  There should never be any data

You should be able to ingest like this without a problem.  It is a basic
test.  "Hold time" is the mechanism by which ingest is pushed back so that
the tserver can get the data written to disk.  You should not have to
manually back off.  Also, the tserver dynamically changes the point at
which it flushes data from memory, so you should see less and less hold

The garbage collector cannot run if the METADATA table is not online, or
has an inconsistent state.

You are probably seeing a lower number of tablets because not all the
tablets are online.  They are probably offline due to failed recoveries.

If you are running Accumulo 1.4, make sure you have stopped and restarted
all the loggers on the system.


On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <>wrote:

> I had a plain Java program, single-threaded, that read an HDFS
> Sequence File with fairly small Sqoop records (probably under 200
> bytes each). As each record was read a Mutation was created, then
> written via Batch Writer to Accumulo. This program was as simple as it
> gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a
> date) so the ingest targeted one tablet. The ingest rate was over 150
> million entries for about 19 hours. Everything seemed fine. Over 3.5
> Billion entries were written. Then the nodes ran out of memory and
> Accumulo nodes went dead. 90% of the server was lost. And data poofed
> out of existence. Only 800M entries are visible now.
> We restarted the data node processes and the cluster has been running
> garbage collection for over 2 days.
> I did not expect this simple approach to cause an issue. From looking
> at the logs file, I think that at least two compactions were being run
> while still ingested those 176 million entries per hour. The hold
> times started rising and eventually the system simply ran out of
> memory. I have no certainty about this explanation though.
> My current thinking is to re-initialize Accumulo and find some way to
> programatically monitoring the hold time. The add a delay to the
> ingest process whenever the hold time rises over 30 seconds. Does that
> sound feasible?
> I know there are other approaches to ingest and I might give up this
> method and use another. I was trying to get some kind of baseline for
> analysis reasons with this approach.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message