accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <>
Subject Re: Accumulo v1.4.1 - ran out of memory and lost data
Date Mon, 28 Jan 2013 14:32:04 GMT
And make sure the loggers didn't fill up their disk.

Sent from my phone, please pardon the typos and brevity.
On Jan 28, 2013 8:54 AM, "Eric Newton" <> wrote:

> What version of accumulo was this?
> So, you have evidence (such as a message in a log) that the tablet server
> ran out of memory?  Can you post that information?
> The ingested data should have been captured in the write-ahead log, and
> recovered when the server was restarted.  There should never be any data
> loss.
> You should be able to ingest like this without a problem.  It is a basic
> test.  "Hold time" is the mechanism by which ingest is pushed back so that
> the tserver can get the data written to disk.  You should not have to
> manually back off.  Also, the tserver dynamically changes the point at
> which it flushes data from memory, so you should see less and less hold
> time.
> The garbage collector cannot run if the METADATA table is not online, or
> has an inconsistent state.
> You are probably seeing a lower number of tablets because not all the
> tablets are online.  They are probably offline due to failed recoveries.
> If you are running Accumulo 1.4, make sure you have stopped and restarted
> all the loggers on the system.
> -Eric
> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <
> >wrote:
> > I had a plain Java program, single-threaded, that read an HDFS
> > Sequence File with fairly small Sqoop records (probably under 200
> > bytes each). As each record was read a Mutation was created, then
> > written via Batch Writer to Accumulo. This program was as simple as it
> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a
> > date) so the ingest targeted one tablet. The ingest rate was over 150
> > million entries for about 19 hours. Everything seemed fine. Over 3.5
> > Billion entries were written. Then the nodes ran out of memory and
> > Accumulo nodes went dead. 90% of the server was lost. And data poofed
> > out of existence. Only 800M entries are visible now.
> >
> > We restarted the data node processes and the cluster has been running
> > garbage collection for over 2 days.
> >
> > I did not expect this simple approach to cause an issue. From looking
> > at the logs file, I think that at least two compactions were being run
> > while still ingested those 176 million entries per hour. The hold
> > times started rising and eventually the system simply ran out of
> > memory. I have no certainty about this explanation though.
> >
> > My current thinking is to re-initialize Accumulo and find some way to
> > programatically monitoring the hold time. The add a delay to the
> > ingest process whenever the hold time rises over 30 seconds. Does that
> > sound feasible?
> >
> > I know there are other approaches to ingest and I might give up this
> > method and use another. I was trying to get some kind of baseline for
> > analysis reasons with this approach.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message