accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William Slacum (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3727) FileNotFoundException on failed/data during recovery
Date Fri, 11 Mar 2016 18:27:52 GMT


William Slacum commented on ACCUMULO-3727:

I'm thinking what happens is that a failed recovery, then a reassignment, causes an issue
where the the tserver doesn't check for the "failed" marker. The fact that the name of the
marker file is just a magic string, and not some constant, makes my intuition about that stronger.

Deleting the failed file, then proceeding is pretty much the only option I found to get going
afterwards. I don't believe any data should be lost, since the recovery shouldn't be destructive
on the WAL, but I don't believe there's a direct test for this case.

> FileNotFoundException on failed/data during recovery
> ----------------------------------------------------
>                 Key: ACCUMULO-3727
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.2
>            Reporter: William Slacum
> Over night there was a mass failure of Accumulo (most likely due to too many mappers
for a job). After restarting Accumulo, one of the metadata tablets failed to load. There was
a log message showing a `FileNotFoundException` on the file `hdfs:///accumulo/recovery/<log
id>/failed/data`. Removing the `<log id>` directory from HDFS seemed to unclog the
jam and things came back (though potentially with data loss).
> I wanted to investigate why somewhere in the plumbing of `TabletServer`, `TabletServerLogger`,
and `SortedLogRecovery`, an attempt was made to use the `failure` file.
> I see in `SortedLogRecovery#sort` where the marker file gets created:
> {code}
> public void sort(String name, Path srcPath, String destPath) {
> ...
>       } catch (Throwable t) {
>         try {
>           // parent dir may not exist
>           fs.mkdirs(new Path(destPath));
>           fs.create(new Path(destPath, "failed")).close();
>         } catch (IOException e) {
>           log.error("Error creating failed flag file " + name, e);
>         }
>         log.error(t, t);
>       } finally {
> ...
> {code}
> I have not stepped out to figure out where/why the `failed` files gets included in the
list of recovered data dir.

This message was sent by Atlassian JIRA

View raw message