accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4031) consistency check failure
Date Fri, 16 Oct 2015 15:01:05 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960835#comment-14960835
] 

Eric Newton commented on ACCUMULO-4031:
---------------------------------------

Other details:

* just after the bulk import, there was a NN fail-over (HA fail-over)
* the tablet servers recorded unexpected EOF's during compactions
* DfsClient reported read errors

In general, the system was quite stressed. This is not unusual, though.

> consistency check failure
> -------------------------
>
>                 Key: ACCUMULO-4031
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4031
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.4
>         Environment: Very large production cluster
>            Reporter: Eric Newton
>
> Sorry for the lack of concrete details, but my logs are not online.
> This system does a lot of bulk ingest.  When it was shut down, a few tablets complained
about inconsistency of their file list with what was in the metadata table.
> I tracked one down, the others appear to be similar.
> First, the inconsistency was an "extra" bulk import file in the metadata table, which
was missing from the in-memory list.
> The file was attempted to bulk loaded into the tablet, but the bulk-load failed.  It
failed due to a constraint violation: the bulk transaction was no longer running.
> Except, really, it was. The constraint fired during the update of the tablets' metadata.
 The server of the metadata tablet was having a (brief) connection problem with zookeeper,
which is where the bulk transaction status is stored.
> The importing tablet server saw the constraint violation, and didn't add the file to
the in-memory list.  However, 2 minutes later, the bulk import was retried, and (consulting
the metadata table), it claimed the file *was* imported already.
> So, in the intervening 2 minutes, somehow the update was made to the metadata tablet.
>  * No splitting of either tablet occurred during this event.
>  * Neither tablet was moved during this event.
>  * No recovery of the metadata table took place.
>  * The tablet server never reported the file imported.
> I reviewed the handling of constraints, and it looks correct (despite ACCUMULO-4029).
> With ACCUMULO-3327, the tablet server will not reject retries, because it does not re-consult
the metadata table.
> I don't know how the mutations would be applied without the tablet server reporting the
file as loaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message