accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Resolved] (ACCUMULO-2333) "File does not exist" error during client ingest with agitation
Date Mon, 10 Feb 2014 17:57:19 GMT


Eric Newton resolved ACCUMULO-2333.

    Resolution: Duplicate
      Assignee: Eric Newton

Evidence: tserver_slave03 completes a minor compaction of tablet 2;\x00\x00\x02;\x00\x00\x01
at 2014-02-06 20:34:58,160.  But this tablet did not have a successful open.  There are two
tablet objects in the tserver at the same time, which is bad.

> "File does not exist" error during client ingest with agitation
> ---------------------------------------------------------------
>                 Key: ACCUMULO-2333
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Luke Brassard
>            Assignee: Eric Newton
>         Attachments: master.log, tserver_slave03.log, tserver_slave05.log
> While running the agitator during a client ingest test, encountered a "File does not
exist" error that stuck in the Table Problems section of the monitor page. 
> Confirmed that the file in question had been compacted away previously.
> While it appears that no data was lost, it is strange that the error surfaced and then
seemed to right itself shortly thereafter. (though not updating the Table Problems section)
> Here is the stacktrace from the Monitor:
> {code}
> File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf( at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
at org.apache.hadoop.ipc.RPC$ at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$ at
Method) at at
at org.apache.hadoop.ipc.Server$
> {code}
> *UPDATE (from comments below):*
> On a cluster with 15 slaves, two of the participating tablet servers had logs referencing
the file.
> slave05 was one that was killed by the agitator at 20:33 and then restarted at 20:43,
where it immediately compacted F0000dwj.rf. That file had been created by slave03 at 20:34
when slave05 was offline. slave03, who seems to have previously been responsible for the file,
then tried to perform a MajC at 21:10, which caused the exceptions to appear in the monitor.
It seems that the master was also killed at 21:02 and was revived at 21:05. It appears that
the "missing" extent was never unloaded and re-assigned before the failure.
> There were RuntimeExceptions reported by slave03 at about 20:34 as well, so there's a
chance that slave03's actions at that time did not complete cleanly.
> I'm attaching logs for the time and pertinent servers.

This message was sent by Atlassian JIRA

View raw message