Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAP6MKsubiaGipfr0AANnp9vRsf229e1=wSfTQJ_WxEXJWpzppQ@mail.gmail.com>
References: 
 <CAP6MKsubiaGipfr0AANnp9vRsf229e1=wSfTQJ_WxEXJWpzppQ@mail.gmail.com>
Date: Wed, 18 Feb 2015 18:42:53 -0500
Message-ID: 
 <CAL5zq9Z3_sdPOLd+JknEM08b8d_Z+2N86inSes07dUfonoW71w@mail.gmail.com>
Subject: Re: Values go to a wrong table during recovery.
From: Christopher <ctubbsii@apache.org>
To: user@accumulo.apache.org, Accumulo Dev List <dev@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=001a11c1672ccbadeb050f656035

--001a11c1672ccbadeb050f656035
Content-Type: text/plain; charset=UTF-8

Hi Denis,

This doesn't sound like a known bug to me. Your hypothesis is reasonable,
since WALs use a surrogate ID, which maps to table ID/tablet information,
when read back. It is possible that it incorrectly interprets this mapping
and replays data into the wrong table. Given the amount of testing we do,
my instinct is to think this is unlikely, but if we can confirm this bug,
it would definitely be a very critical one.

To rule out some scenarios, is it possible that your clients are writing to
the wrong tables? Have you ever seen a failure affecting a table which does
not exist (like what might happen if there's an off-by-one error in the WAL
code)? Or affecting the metadata tables?

Can you reproduce this error reliably, or can you share the relevant ingest
code which can reproduce this failure? Also, what kind of tablet server
failures are you experiencing when this happens?

If you could file a bug report at https://issues.apache.org/browse/ACCUMULO
with any details and/or attachments to help us address the issue, we would
greatly appreciate it. This seems like something we'd want to fix pretty
quickly.

Thanks!


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Wed, Feb 18, 2015 at 6:26 PM, Denis <denis@camfex.cz> wrote:

> Hello.
>
> Few times I noticed that some tables have values they cannot have, and
> those entries have timestamp close to a tabletserver failure time.
> (I mean wrong format, one table has msgpack values at least 10 bytes
> long and another table has 1-byte values and after a failure I read
> one or two 1-byte values in the table where I expect to read msgpack).
>
> I suspect that during recovery process, when WAL is being read, some
> entries are inserted to a wrong table.
>
> May be it is a know bug as I am still using Accumulo 1.6.1
>

--001a11c1672ccbadeb050f656035
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Hi Denis,<br></div><div><br>This doesn&#39;=
t sound like a known bug to me. Your hypothesis is reasonable, since WALs u=
se a surrogate ID, which maps to table ID/tablet information, when read bac=
k. It is possible that it incorrectly interprets this mapping and replays d=
ata into the wrong table. Given the amount of testing we do, my instinct is=
 to think this is unlikely, but if we can confirm this bug, it would defini=
tely be a very critical one.<br><br></div>To rule out some scenarios, is it=
 possible that your clients are writing to the wrong tables? Have you ever =
seen a failure affecting a table which does not exist (like what might happ=
en if there&#39;s an off-by-one error in the WAL code)? Or affecting the me=
tadata tables?<br><br>Can you reproduce this error reliably, or can you sha=
re the relevant ingest code which can reproduce this failure? Also, what ki=
nd of tablet server failures are you experiencing when this happens?<br><br=
></div>If you could file a bug report at <a href=3D"https://issues.apache.o=
rg/browse/ACCUMULO">https://issues.apache.org/browse/ACCUMULO</a> with any =
details and/or attachments to help us address the issue, we would greatly a=
ppreciate it. This seems like something we&#39;d want to fix pretty quickly=
.<br><br></div><div>Thanks!<br></div><div class=3D"gmail_extra"><br clear=
=3D"all"><div><div class=3D"gmail_signature"><br>--<br>Christopher L Tubbs =
II<br><a href=3D"http://gravatar.com/ctubbsii" target=3D"_blank">http://gra=
vatar.com/ctubbsii</a></div></div>
<br><div class=3D"gmail_quote">On Wed, Feb 18, 2015 at 6:26 PM, Denis <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:denis@camfex.cz" target=3D"_blank">denis=
@camfex.cz</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello.<b=
r>
<br>
Few times I noticed that some tables have values they cannot have, and<br>
those entries have timestamp close to a tabletserver failure time.<br>
(I mean wrong format, one table has msgpack values at least 10 bytes<br>
long and another table has 1-byte values and after a failure I read<br>
one or two 1-byte values in the table where I expect to read msgpack).<br>
<br>
I suspect that during recovery process, when WAL is being read, some<br>
entries are inserted to a wrong table.<br>
<br>
May be it is a know bug as I am still using Accumulo 1.6.1<br>
</blockquote></div><br></div></div>

--001a11c1672ccbadeb050f656035--