accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (ACCUMULO-2333) "File does not exist" error during client ingest with agitation
Date Mon, 10 Feb 2014 17:57:19 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Newton resolved ACCUMULO-2333.
-----------------------------------

    Resolution: Duplicate
      Assignee: Eric Newton

Evidence: tserver_slave03 completes a minor compaction of tablet 2;\x00\x00\x02;\x00\x00\x01
at 2014-02-06 20:34:58,160.  But this tablet did not have a successful open.  There are two
tablet objects in the tserver at the same time, which is bad.


> "File does not exist" error during client ingest with agitation
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-2333
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2333
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Luke Brassard
>            Assignee: Eric Newton
>         Attachments: master.log, tserver_slave03.log, tserver_slave05.log
>
>
> While running the agitator during a client ingest test, encountered a "File does not
exist" error that stuck in the Table Problems section of the monitor page. 
> Confirmed that the file in question had been compacted away previously.
> While it appears that no data was lost, it is strange that the error surfaced and then
seemed to right itself shortly thereafter. (though not updating the Table Problems section)
> Here is the stacktrace from the Monitor:
> {code}
> File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
> {code}
> *UPDATE (from comments below):*
> On a cluster with 15 slaves, two of the participating tablet servers had logs referencing
the file.
> slave05 was one that was killed by the agitator at 20:33 and then restarted at 20:43,
where it immediately compacted F0000dwj.rf. That file had been created by slave03 at 20:34
when slave05 was offline. slave03, who seems to have previously been responsible for the file,
then tried to perform a MajC at 21:10, which caused the exceptions to appear in the monitor.
It seems that the master was also killed at 21:02 and was revived at 21:05. It appears that
the "missing" extent was never unloaded and re-assigned before the failure.
> There were RuntimeExceptions reported by slave03 at about 20:34 as well, so there's a
chance that slave03's actions at that time did not complete cleanly.
> I'm attaching logs for the time and pertinent servers.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message