accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis <de...@camfex.cz>
Subject Re: Values go to a wrong table during recovery.
Date Fri, 20 Feb 2015 20:33:17 GMT
>  If you can provide a test client that has ever replicated the problem, please attach
it to the ticket.

I have seen it 3 times within a month timeframe, so I do not know how
to reproduce it reliable.
Perhaps, I have to backup walogs next time and look into them.

> Is this the exact same cluster or is it just the same code you were using?

Same code, another cluster.

> Did you have walogs laying around when you upgraded?

In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled
for data tables and disabled for index tables.

There was a bug in 1.4, if a tablet had empty walog there were some
startup issues (tablet remains offline or something like this), and it
happened often with index tables (hmm, the same tables I have this
problem).

So, in 1.4-cluster I disabled walog and ran full reindex periodically.
After running 1.6-cluster some time I enabled walogs for all tables as
the new cluster have less reliable hardware, which reboots from time
to time.

> Did you upgrade through 1.5 or straight from 1.4 to 1.6?

>From 1.4 to 1.6. But it was not upgrade, it was copy of .rf files to a
new cluster and then importdirectory.

On 2/20/15, John Vines <vines@apache.org> wrote:
> You said that you were operating this on 1.4. Is this the exact same
> cluster or is it just the same code you were using? Did you have walogs
> laying around when you upgraded? Did you upgrade through 1.5 or straight
> from 1.4 to 1.6?
>
> On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <keith@deenlo.com> wrote:
>
>> I updated ACCUMULO-3603 w/ details about an experiment I ran.
>>
>> On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <eric.newton@gmail.com>
>> wrote:
>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-3603
>>>
>>> -Eric
>>>
>>>
>>> On Wed, Feb 18, 2015 at 7:12 PM, Denis <denis@camfex.cz> wrote:
>>>
>>>> On 2/18/15, Christopher <ctubbsii@apache.org> wrote:
>>>>
>>>> > To rule out some scenarios, is it possible that your clients are
>>>> writing to
>>>> > the wrong tables?
>>>> That was the first idea, so I added assert()'s to the code of the
>>>> writers few days ago. No assert was triggered, but some invalid values
>>>> appear after new tserver failure.
>>>>
>>>> > Have you ever seen a failure affecting a table which does
>>>> > not exist (like what might happen if there's an off-by-one error in
>>>> the WAL
>>>> > code)? Or affecting the metadata tables?
>>>> No.
>>>> Also, no tables were created or deleted during last two months.
>>>>
>>>> > Can you reproduce this error reliably, or can you share the relevant
>>>> ingest
>>>> > code which can reproduce this failure?
>>>>
>>>> I will think how to reproduce it.
>>>> What could be special about the code: inserts are performed to few
>>>> (5..8) tables at once (one data table + few index tables) but no
>>>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>>>> created and flushed consequentially, in the same thread. For Accumulo
>>>> 1.4 it was a performance optimization, if worked faster than
>>>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>>>> not changed after migration to 1.6.1.
>>>> In all cases with invalid values the index tables were affected (one
>>>> of the index table had values typical for another of the index
>>>> tables).
>>>>
>>>> > Also, what kind of tablet server failures are you experiencing when
>>>> this happens?
>>>> Spontaneous power-offs. There is something wrong with the power units
>>>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>>>
>>>
>>>
>>
>

Mime
View raw message