Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C31717467 for ; Wed, 18 Feb 2015 23:42:58 +0000 (UTC) Received: (qmail 39295 invoked by uid 500); 18 Feb 2015 23:42:55 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 39223 invoked by uid 500); 18 Feb 2015 23:42:55 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 39206 invoked by uid 99); 18 Feb 2015 23:42:55 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2015 23:42:55 +0000 Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id B56F11A02BD; Wed, 18 Feb 2015 23:42:54 +0000 (UTC) Received: by mail-qc0-f182.google.com with SMTP id r5so3719518qcx.13; Wed, 18 Feb 2015 15:42:53 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.140.102.170 with SMTP id w39mr5472037qge.100.1424302973625; Wed, 18 Feb 2015 15:42:53 -0800 (PST) Received: by 10.229.169.204 with HTTP; Wed, 18 Feb 2015 15:42:53 -0800 (PST) In-Reply-To: References: Date: Wed, 18 Feb 2015 18:42:53 -0500 Message-ID: Subject: Re: Values go to a wrong table during recovery. From: Christopher To: user@accumulo.apache.org, Accumulo Dev List Content-Type: multipart/alternative; boundary=001a11c1672ccbadeb050f656035 --001a11c1672ccbadeb050f656035 Content-Type: text/plain; charset=UTF-8 Hi Denis, This doesn't sound like a known bug to me. Your hypothesis is reasonable, since WALs use a surrogate ID, which maps to table ID/tablet information, when read back. It is possible that it incorrectly interprets this mapping and replays data into the wrong table. Given the amount of testing we do, my instinct is to think this is unlikely, but if we can confirm this bug, it would definitely be a very critical one. To rule out some scenarios, is it possible that your clients are writing to the wrong tables? Have you ever seen a failure affecting a table which does not exist (like what might happen if there's an off-by-one error in the WAL code)? Or affecting the metadata tables? Can you reproduce this error reliably, or can you share the relevant ingest code which can reproduce this failure? Also, what kind of tablet server failures are you experiencing when this happens? If you could file a bug report at https://issues.apache.org/browse/ACCUMULO with any details and/or attachments to help us address the issue, we would greatly appreciate it. This seems like something we'd want to fix pretty quickly. Thanks! -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Wed, Feb 18, 2015 at 6:26 PM, Denis wrote: > Hello. > > Few times I noticed that some tables have values they cannot have, and > those entries have timestamp close to a tabletserver failure time. > (I mean wrong format, one table has msgpack values at least 10 bytes > long and another table has 1-byte values and after a failure I read > one or two 1-byte values in the table where I expect to read msgpack). > > I suspect that during recovery process, when WAL is being read, some > entries are inserted to a wrong table. > > May be it is a know bug as I am still using Accumulo 1.6.1 > --001a11c1672ccbadeb050f656035 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Denis,

This doesn'= t sound like a known bug to me. Your hypothesis is reasonable, since WALs u= se a surrogate ID, which maps to table ID/tablet information, when read bac= k. It is possible that it incorrectly interprets this mapping and replays d= ata into the wrong table. Given the amount of testing we do, my instinct is= to think this is unlikely, but if we can confirm this bug, it would defini= tely be a very critical one.

To rule out some scenarios, is it= possible that your clients are writing to the wrong tables? Have you ever = seen a failure affecting a table which does not exist (like what might happ= en if there's an off-by-one error in the WAL code)? Or affecting the me= tadata tables?

Can you reproduce this error reliably, or can you sha= re the relevant ingest code which can reproduce this failure? Also, what ki= nd of tablet server failures are you experiencing when this happens?
If you could file a bug report at https://issues.apache.org/browse/ACCUMULO with any = details and/or attachments to help us address the issue, we would greatly a= ppreciate it. This seems like something we'd want to fix pretty quickly= .

Thanks!


--
Christopher L Tubbs = II
http://gra= vatar.com/ctubbsii

On Wed, Feb 18, 2015 at 6:26 PM, Denis <denis= @camfex.cz> wrote:
Hello.
Few times I noticed that some tables have values they cannot have, and
those entries have timestamp close to a tabletserver failure time.
(I mean wrong format, one table has msgpack values at least 10 bytes
long and another table has 1-byte values and after a failure I read
one or two 1-byte values in the table where I expect to read msgpack).

I suspect that during recovery process, when WAL is being read, some
entries are inserted to a wrong table.

May be it is a know bug as I am still using Accumulo 1.6.1

--001a11c1672ccbadeb050f656035--