Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of david.medinets@gmail.com
 designates 209.85.128.171 as permitted sender)
MIME-Version: 1.0
Date: Mon, 28 Jan 2013 11:24:35 -0500
Message-ID: 
 <CAOiJXP5acB2Tahh_G+tHXa8tH5D8JxPXq=LKPKopVToHUKWOhw@mail.gmail.com>
Subject: Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED -
 Data was restored)
From: David Medinets <david.medinets@gmail.com>
To: dev@accumulo.apache.org, vines@apache.org
Content-Type: text/plain; charset=ISO-8859-1

Accumulo fully recovered when I restarted the loggers. Very impressive.

On Mon, Jan 28, 2013 at 9:32 AM, John Vines <vines@apache.org> wrote:
> And make sure the loggers didn't fill up their disk.
>
> Sent from my phone, please pardon the typos and brevity.
> On Jan 28, 2013 8:54 AM, "Eric Newton" <eric.newton@gmail.com> wrote:
>
>> What version of accumulo was this?
>>
>> So, you have evidence (such as a message in a log) that the tablet server
>> ran out of memory?  Can you post that information?
>>
>> The ingested data should have been captured in the write-ahead log, and
>> recovered when the server was restarted.  There should never be any data
>> loss.
>>
>> You should be able to ingest like this without a problem.  It is a basic
>> test.  "Hold time" is the mechanism by which ingest is pushed back so that
>> the tserver can get the data written to disk.  You should not have to
>> manually back off.  Also, the tserver dynamically changes the point at
>> which it flushes data from memory, so you should see less and less hold
>> time.
>>
>> The garbage collector cannot run if the METADATA table is not online, or
>> has an inconsistent state.
>>
>> You are probably seeing a lower number of tablets because not all the
>> tablets are online.  They are probably offline due to failed recoveries.
>>
>> If you are running Accumulo 1.4, make sure you have stopped and restarted
>> all the loggers on the system.
>>
>> -Eric
>>
>> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <david.medinets@gmail.com
>> >wrote:
>>
>> > I had a plain Java program, single-threaded, that read an HDFS
>> > Sequence File with fairly small Sqoop records (probably under 200
>> > bytes each). As each record was read a Mutation was created, then
>> > written via Batch Writer to Accumulo. This program was as simple as it
>> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a
>> > date) so the ingest targeted one tablet. The ingest rate was over 150
>> > million entries for about 19 hours. Everything seemed fine. Over 3.5
>> > Billion entries were written. Then the nodes ran out of memory and
>> > Accumulo nodes went dead. 90% of the server was lost. And data poofed
>> > out of existence. Only 800M entries are visible now.
>> >
>> > We restarted the data node processes and the cluster has been running
>> > garbage collection for over 2 days.
>> >
>> > I did not expect this simple approach to cause an issue. From looking
>> > at the logs file, I think that at least two compactions were being run
>> > while still ingested those 176 million entries per hour. The hold
>> > times started rising and eventually the system simply ran out of
>> > memory. I have no certainty about this explanation though.
>> >
>> > My current thinking is to re-initialize Accumulo and find some way to
>> > programatically monitoring the hold time. The add a delay to the
>> > ingest process whenever the hold time rises over 30 seconds. Does that
>> > sound feasible?
>> >
>> > I know there are other approaches to ingest and I might give up this
>> > method and use another. I was trying to get some kind of baseline for
>> > analysis reasons with this approach.
>> >
>>