hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khaja Hussain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6312) Hive fails due to stale proxy in ClientServiceDelegate
Date Fri, 01 May 2015 15:56:07 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523360#comment-14523360
] 

Khaja Hussain commented on MAPREDUCE-6312:
------------------------------------------

We had very similar issue. And found out the /var directory on namenode was full and namenode
came down which caused our hive query to fail with this error. I have the details below.

Error Log from Hive:
**************************

2015-05-01 08:51:35,774 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 1818.78 sec
java.io.IOException: Could not find status of job:job_1422659026472_30162
        at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:294)
        at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
        at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
        at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1508)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1275)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1093)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
        at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
        at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Ended Job = job_1422659026472_30162 with exception 'java.io.IOException(Could not find status
of job:job_1422659026472_30162)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3  Reduce: 5   Cumulative CPU: 5376.95 sec   HDFS Read: 4371724652 HDFS Write:
3100170286 SUCCESS


Log from NameNode:
*****************************

log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
        at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
        at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:176)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2027)
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
        at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:369)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:176)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$DefaultAuditLogger.logAuditMessage(FSNamesystem.java:7792)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$DefaultAuditLogger.logAuditEvent(FSNamesystem.java:7787)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:340)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:320)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3532)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileLinkInfo(NameNodeRpcServer.java:796)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileLinkInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:780)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)

Logs from YARN from one of our datanodes:
*********************************************************

2015-05-01 08:47:39,234 ERROR logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(115))
- Cannot create writer for app application_1422659026472_30160. Disabling log-aggregation
for this app.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
Cannot create file/app-logs/aiqa/logs/application_1422659026472_30160/lpwhp6.npd.com_45454.tmp.
Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually.
NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to
safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1207)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2235)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2190)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:526)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at $Proxy75.create(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
        at $Proxy75.create(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:258)

> Hive fails due to stale proxy in ClientServiceDelegate
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-6312
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6312
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.5.0
>            Reporter: Radim Kubacki
>
> ClientServiceDelegate initializes its realProxy field to AMProxy for a new or running
job. Later when the job finishes it will not update this proxy to query history server and
AM will not return valid data for this job.
> We found this while investigating https://issues.cloudera.org/browse/DISTRO-631 that
describes Hive failure because it uses loop like 
> {code}
>   progress(JobClient jc, RunningJob rj) { ...
>         while (!rj.isComplete() || (extraRounds > 0)) {
>             try {
>                 Thread.sleep(1000);
>             } catch (InterruptedException e) {
>             }
>             RunningJob newRj = jc.getJob(rj.getID());
>             if (newRj == null) {
>                 // under exceptional load, hadoop may not be able to look up status
>                 // of finished jobs (because it has purged them from memory). From
>                 // hive's perspective - it's equivalent to the job having failed.
>                 // So raise a meaningful exception
>                 throw new IOException("Could not find status of job:" + rj.getID());
>             } else {
>                 rj = newRj;
>             }
>         }
> {code}
> In this snippet JobClient.getJob will try to create RunningJob instance referring to
job file in /user/$USER/.staging even when job is finished and the file is moved to /user/history/done
(or /user/history/done_intermediate). 
> Note that Hive queries can succeed if there is a timing where HDFS performs actual file
delete with a delay.
> We can try to write a patch if there is an agreement that this should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message