accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4542) Tablet left in bad state after bulk import timeout
Date Tue, 03 Jan 2017 22:53:58 GMT


Christopher Tubbs commented on ACCUMULO-4542:

This seems really hard to reproduce. [~kturner] tells me he believes there is a final check
before it moves, and it might do a copy instead of a move, if it has failed for some tablets
but not others (in the case of the file overlapping several tablets). If he's right, then
it's possible there was a failure reading the metadata table to confirm, and the system treated
this failure to validate as a false-positive failure to assign. I'm not sure there's a sane
way to handle that case... which is better than the result you saw.

> Tablet left in bad state after bulk import timeout
> --------------------------------------------------
>                 Key: ACCUMULO-4542
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.7.2
>            Reporter: John Vines
> On a cluster we saw a large amount of network issues at one point. Cause still has not
been pinpointed, but it did result in us seeing a lot of rpc exceptions and the like.
> While these network issues happened, a bulk import was kicked off for a single file.
This single file was assigned to two tablets (which both happened to be on the same server).
Unfortunately, in the 3 attempts bulk import made to assign this file to this tablet, there
were 3 rpc exceptions due to a socket timeout. After the three failures the bulk import went
ahead and moved this file to the failures directory and carried on.
> Unfortunately, this file was actually assigned to the tablet succesfully on the first
attempt. The following 2 attempts logged about how the server had already been assigned this
file. It was shortly afterward a query came in (and then later major compactions) which then
complained about how the file could not be found because the bulk import moved it to the failures
> I think in this event we need some sort of final validation the record didn't end up
in the metadata table before we move it to the failures.

This message was sent by Atlassian JIRA

View raw message