accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Brassard (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-2333) "File does not exist" error during client ingest with agitation
Date Fri, 07 Feb 2014 23:37:19 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luke Brassard updated ACCUMULO-2333:
------------------------------------

    Description: 
While running the agitator during a client ingest test, encountered a "File does not exist"
error that stuck in the Table Problems section of the monitor page. 

Confirmed that the file in question had been compacted away previously.

While it appears that no data was lost, it is strange that the error surfaced and then seemed
to right itself shortly thereafter. (though not updating the Table Problems section)

Here is the stacktrace from the Monitor:
{code}
File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{code}

_UPDATE (from comments below):_
On a cluster with 15 slaves, two of the participating tablet servers had logs referencing
the file.

slave05 was one that was killed by the agitator at 20:33 and then restarted at 20:43, where
it immediately compacted F0000dwj.rf. That file had been created by slave03 at 20:34 when
slave05 was offline. slave03, who seems to have previously been responsible for the file,
then tried to perform a MajC at 21:10, which caused the exceptions to appear in the monitor.
It seems that the master was also killed at 21:02 and was revived at 21:05. It appears that
the "missing" extent was never unloaded and re-assigned before the failure.

There were RuntimeExceptions reported by slave03 at about 20:34 as well, so there's a chance
that slave03's actions at that time did not complete cleanly.

I'm attaching logs for the time and pertinent servers.

  was:
While running the agitator during a client ingest test, encountered a "File does not exist"
error that stuck in the Table Problems section of the monitor page. 

Confirmed that the file in question had been compacted away previously.

While it appears that no data was lost, it is strange that the error surfaced and then seemed
to right itself shortly thereafter. (though not updating the Table Problems section)

Here is the stacktrace from the Monitor:
{code}
File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{code}


> "File does not exist" error during client ingest with agitation
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-2333
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2333
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Luke Brassard
>         Attachments: master.log, tserver_slave03.log, tserver_slave05.log
>
>
> While running the agitator during a client ingest test, encountered a "File does not
exist" error that stuck in the Table Problems section of the monitor page. 
> Confirmed that the file in question had been compacted away previously.
> While it appears that no data was lost, it is strange that the error surfaced and then
seemed to right itself shortly thereafter. (though not updating the Table Problems section)
> Here is the stacktrace from the Monitor:
> {code}
> File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
> {code}
> _UPDATE (from comments below):_
> On a cluster with 15 slaves, two of the participating tablet servers had logs referencing
the file.
> slave05 was one that was killed by the agitator at 20:33 and then restarted at 20:43,
where it immediately compacted F0000dwj.rf. That file had been created by slave03 at 20:34
when slave05 was offline. slave03, who seems to have previously been responsible for the file,
then tried to perform a MajC at 21:10, which caused the exceptions to appear in the monitor.
It seems that the master was also killed at 21:02 and was revived at 21:05. It appears that
the "missing" extent was never unloaded and re-assigned before the failure.
> There were RuntimeExceptions reported by slave03 at about 20:34 as well, so there's a
chance that slave03's actions at that time did not complete cleanly.
> I'm attaching logs for the time and pertinent servers.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message