accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Havanki (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2466) Bulk randomwalk fails with bad key
Date Thu, 13 Mar 2014 17:55:43 GMT


Bill Havanki commented on ACCUMULO-2466:

There is a bulk import failure in the master shortly after the restart.

2014-03-12 14:10:17,330 [thrift.MasterClientService$Processor] ERROR: Internal error processing
java.lang.RuntimeException: Filesystem closed
Caused by: Filesystem closed
        at org.apache.hadoop.fs.FileSystem.create(
        at org.apache.accumulo.server.trace.TraceFileSystem.create(
        at org.apache.accumulo.server.fate.Fate$
        ... 2 more

The line in {{LoadFiles}} causing the problem:

FSDataOutputStream failFile = fs.create(new Path(errorDir, "failures.txt"), true);

Somehow the filesystem reference generated at the start of the action is closed before the
action is done. However, the exception is thrown after tablet servers are asked to do bulk
imports, and none of them indicate any trouble performing the bulk import, so I wonder why
that marker 18 didn't show up. I don't know enough about this mechanism to hazard a good guess,
but it could be that this error is not what caused the problem.

Still, it's my best lead. There are exactly two other {{waitForTableOperation}} failures right
after this one, but they fail due to interruption of some sort, before the tablet servers
are asked to import. Maybe the master is able to try again successfully for these.

> Bulk randomwalk fails with bad key
> ----------------------------------
>                 Key: ACCUMULO-2466
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, test
>    Affects Versions: 1.4.4
>            Reporter: Bill Havanki
>              Labels: import, randomwalk, test
> Running bulk randomwalk against 1.4.5-SNAPSHOT, got this in verification:
> {noformat}
> Caused by: java.lang.Exception: Bad key at r00000 cf:000 [] 1394658887772 false 1
>         at org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(
> {noformat}
> Possible reasons:
> * ACCUMULO-2110, not backported to 1.4 or 1.5
> * master agitation
> I see in the logs three internal errors from imports that failed due to the masters being
restarted. The failure timing is around 5 seconds after the masters restart. Example:
> {noformat}
> 12 14:10:17,580 [bulk.BulkMinusOne] ERROR: org.apache.accumulo.core.client.AccumuloException:
> al error processing waitForTableOperation
> org.apache.accumulo.core.client.AccumuloException: Internal error processing waitForTableOperation
>         at org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperation
>         at org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperation
>         at org.apache.accumulo.core.client.admin.TableOperationsImpl.importDirectory(TableOperations
>         at org.apache.accumulo.server.test.randomwalk.bulk.BulkPlusOne.bulkLoadLots(
> :99)
>         at org.apache.accumulo.server.test.randomwalk.bulk.BulkMinusOne.runLater(
> 9)
> ...
> Caused by: org.apache.thrift.TApplicationException: Internal error processing waitForTableOperation
> {noformat}
> Two BulkMinusOne and one BulkPlusOne failed, which may be why the offending row was at
value 1.
> The {{TableOperationsImpl.waitForTableOperation}} method does not catch {{TApplicationException}},
so the imports fail.
> I see lots of previous work on this sort of error in ACCUMULO-334 and ACCUMULO-2110.
If anyone has troubleshooting tips I'd be happy to hear them.

This message was sent by Atlassian JIRA

View raw message