Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81F2DE203 for ; Mon, 28 Jan 2013 16:25:03 +0000 (UTC) Received: (qmail 55732 invoked by uid 500); 28 Jan 2013 16:25:03 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 55699 invoked by uid 500); 28 Jan 2013 16:25:03 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 55687 invoked by uid 99); 28 Jan 2013 16:25:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 16:25:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of david.medinets@gmail.com designates 209.85.128.171 as permitted sender) Received: from [209.85.128.171] (HELO mail-ve0-f171.google.com) (209.85.128.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 16:24:56 +0000 Received: by mail-ve0-f171.google.com with SMTP id 14so1417862vea.16 for ; Mon, 28 Jan 2013 08:24:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=jETz/nOVVW03Mv9COIpewveMlgGRXGbV0t7X6dMPvEU=; b=tW8gjzK76XOD0cTKgShl6ert88dht9Pp1ghUmLN08k9H5F6BXRBm4p4P6pP2mRqozn opq1ueWKideRpulEUs8cSpA293sNMUTb8InTUdq6h8FMGKByGZrM+oRWOy90+SgWwBLf VgiwmnzVbbwYpNXii18R9OzyHBpGdKcH03Zy2xbmCJLuoGVUpV9tCnAmnfHYEmnAFvnY nrudS7ebKIWME4NRcB4feTShZQ/93yRZzbZwJbi6UCt87GL+JtRL5Ns/Hsx7+KKHgdTN sqNVhsT+GJ4kJ3lR8x9sXMEUlIaLTWIhIDuUcswAg8JInvP6U5wj7iUCn0OJ4oBgDZzR D5Xg== MIME-Version: 1.0 X-Received: by 10.220.239.14 with SMTP id ku14mr15298344vcb.57.1359390276079; Mon, 28 Jan 2013 08:24:36 -0800 (PST) Received: by 10.52.27.175 with HTTP; Mon, 28 Jan 2013 08:24:35 -0800 (PST) Date: Mon, 28 Jan 2013 11:24:35 -0500 Message-ID: Subject: Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED - Data was restored) From: David Medinets To: dev@accumulo.apache.org, vines@apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Accumulo fully recovered when I restarted the loggers. Very impressive. On Mon, Jan 28, 2013 at 9:32 AM, John Vines wrote: > And make sure the loggers didn't fill up their disk. > > Sent from my phone, please pardon the typos and brevity. > On Jan 28, 2013 8:54 AM, "Eric Newton" wrote: > >> What version of accumulo was this? >> >> So, you have evidence (such as a message in a log) that the tablet server >> ran out of memory? Can you post that information? >> >> The ingested data should have been captured in the write-ahead log, and >> recovered when the server was restarted. There should never be any data >> loss. >> >> You should be able to ingest like this without a problem. It is a basic >> test. "Hold time" is the mechanism by which ingest is pushed back so that >> the tserver can get the data written to disk. You should not have to >> manually back off. Also, the tserver dynamically changes the point at >> which it flushes data from memory, so you should see less and less hold >> time. >> >> The garbage collector cannot run if the METADATA table is not online, or >> has an inconsistent state. >> >> You are probably seeing a lower number of tablets because not all the >> tablets are online. They are probably offline due to failed recoveries. >> >> If you are running Accumulo 1.4, make sure you have stopped and restarted >> all the loggers on the system. >> >> -Eric >> >> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets > >wrote: >> >> > I had a plain Java program, single-threaded, that read an HDFS >> > Sequence File with fairly small Sqoop records (probably under 200 >> > bytes each). As each record was read a Mutation was created, then >> > written via Batch Writer to Accumulo. This program was as simple as it >> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a >> > date) so the ingest targeted one tablet. The ingest rate was over 150 >> > million entries for about 19 hours. Everything seemed fine. Over 3.5 >> > Billion entries were written. Then the nodes ran out of memory and >> > Accumulo nodes went dead. 90% of the server was lost. And data poofed >> > out of existence. Only 800M entries are visible now. >> > >> > We restarted the data node processes and the cluster has been running >> > garbage collection for over 2 days. >> > >> > I did not expect this simple approach to cause an issue. From looking >> > at the logs file, I think that at least two compactions were being run >> > while still ingested those 176 million entries per hour. The hold >> > times started rising and eventually the system simply ran out of >> > memory. I have no certainty about this explanation though. >> > >> > My current thinking is to re-initialize Accumulo and find some way to >> > programatically monitoring the hold time. The add a delay to the >> > ingest process whenever the hold time rises over 30 seconds. Does that >> > sound feasible? >> > >> > I know there are other approaches to ingest and I might give up this >> > method and use another. I was trying to get some kind of baseline for >> > analysis reasons with this approach. >> > >>