accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3603) replay mutations to the wrong table?
Date Fri, 20 Feb 2015 18:46:11 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329335#comment-14329335
] 

Keith Turner commented on ACCUMULO-3603:
----------------------------------------

I ran the following experiment with continuous ingest writing to multiple tables.   The experiment
was run on EC2 w/ m1.large with 17 tserver nodes.  I used Accumulo 1.6.1 for the experiment.

 # Created 10 continuous ingest tables
 # Started an ingest client for each table.  Each client ran on a separate node.
 # Added 15 splits to each table, resulting in 16 tablets per table.
 # Let tablets spread such that most tablets servers had a tablet from each table.
 # Started agitation running killing a tserver every few minutes.
 # Stopped ingest and agitation after 5hrs.  There were 51 tserver processes killed and 11
datanode processed killed.  Each table had 650 million key values, on avg.  Min table had
538m max table had 666m.  The tables had all split to 32 tablets while the test was running.
  
 # Ran CI verify M/R job for each table.  All verified successfully.

For this experiment, every time a tserver was killed its write ahead logs had tablets for
multiple tables.  If a tablet had recovered data from another tablet, that could have caused
verification to fail (if the data was in the tablets range).

> replay mutations to the wrong table?
> ------------------------------------
>
>                 Key: ACCUMULO-3603
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3603
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.1
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>            Priority: Critical
>             Fix For: 1.6.3
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A user writes to the mailing list:
> {quote}
> Few times I noticed that some tables have values they cannot have, and
> those entries have timestamp close to a tabletserver failure time.
> (I mean wrong format, one table has msgpack values at least 10 bytes
> long and another table has 1-byte values and after a failure I read
> one or two 1-byte values in the table where I expect to read msgpack).
> I suspect that during recovery process, when WAL is being read, some
> entries are inserted to a wrong table.
> May be it is a know bug as I am still using Accumulo 1.6.1
> {quote}
> Consider adding multiple tables to the continuous ingest test to reproduce.
> Note that the random walk test certainly does this, and no failures have been observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message