accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4542) Tablet left in bad state after bulk import timeout
Date Tue, 03 Jan 2017 22:53:58 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796461#comment-15796461
] 

Christopher Tubbs commented on ACCUMULO-4542:
---------------------------------------------

This seems really hard to reproduce. [~kturner] tells me he believes there is a final check
before it moves, and it might do a copy instead of a move, if it has failed for some tablets
but not others (in the case of the file overlapping several tablets). If he's right, then
it's possible there was a failure reading the metadata table to confirm, and the system treated
this failure to validate as a false-positive failure to assign. I'm not sure there's a sane
way to handle that case... which is better than the result you saw.

> Tablet left in bad state after bulk import timeout
> --------------------------------------------------
>
>                 Key: ACCUMULO-4542
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4542
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.7.2
>            Reporter: John Vines
>
> On a cluster we saw a large amount of network issues at one point. Cause still has not
been pinpointed, but it did result in us seeing a lot of rpc exceptions and the like.
> While these network issues happened, a bulk import was kicked off for a single file.
This single file was assigned to two tablets (which both happened to be on the same server).
Unfortunately, in the 3 attempts bulk import made to assign this file to this tablet, there
were 3 rpc exceptions due to a socket timeout. After the three failures the bulk import went
ahead and moved this file to the failures directory and carried on.
> Unfortunately, this file was actually assigned to the tablet succesfully on the first
attempt. The following 2 attempts logged about how the server had already been assigned this
file. It was shortly afterward a query came in (and then later major compactions) which then
complained about how the file could not be found because the bulk import moved it to the failures
directory.
> I think in this event we need some sort of final validation the record didn't end up
in the metadata table before we move it to the failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message