crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com>
Subject Re: LeaseExpiredExceptions and temp side effect files
Date Tue, 18 Aug 2015 22:18:08 GMT
Hi,

I verified that the pipeline succeeds on the same cc2.8xlarge hardware when
setting crunch.max.running.jobs to 1. I generally feel like the pipeline
application itself logic is sound, at this point. It could be that this is
just taxing these machines too hard and we need to increase the number of
retries?

It reliably fails on this hardware when crunch.max.running.jobs set to its
default.

Can you explain a little what the /tmp/crunch-XXXXXXX files are as well as
how Crunch uses side effect files? Do you know if HDFS would clean up those
directories from underneath Crunch?

There are usually 4 failed applications, failing due to reduces. The
failures seem to be one of the following three kinds -- (1) No lease on
<side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
SocketTimeoutException.

Examples:

[1] No lease exception

Error: org.apache.crunch.CrunchRuntimeException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
File does not exist. Holder
DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
any open files. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
at
org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
File does not exist. Holder
DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
any open files. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
org.apache.hadoop.ipc.Client.call(Client.java:1410) at
org.apache.hadoop.ipc.Client.call(Client.java:1363) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
at com.sun.proxy.$Proxy13.complete(Unknown Source) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy13.complete(Unknown Source) at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
at
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
at
org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
... 9 more


[2] File does not exist

2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error:
org.apache.crunch.CrunchRuntimeException: Could not read runtime node
information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist:
/tmp/crunch-4694113/p470/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
	... 9 more

[3] SocketTimeoutException

Error: org.apache.crunch.CrunchRuntimeException:
java.net.SocketTimeoutException: 70000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720
remote=/10.55.1.230:9200] at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused
by: java.net.SocketTimeoutException: 70000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720
remote=/10.55.1.230:9200] at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83) at
java.io.FilterInputStream.read(FilterInputStream.java:83) at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)













On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <everett@nuna.com> wrote:

>
>
> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Everett,
>>
>> Initial thought-- there are lots of reasons for lease expired exceptions,
>> and their usually more symptomatic of other problems in the pipeline. Are
>> you sure none of the jobs in the Crunch pipeline on the non-SSD instances
>> are failing for some other reason? I'd be surprised if no other errors
>> showed up in the app master, although there are reports of some weirdness
>> around LeaseExpireds when writing to S3-- but you're not doing that here,
>> right?
>>
>
> We're reading from and writing to HDFS, here. (We've copied in input from
> S3 to HDFS in another step.)
>
> There are a few exceptions in the logs. Most seem related to missing temp
> files.
>
> Let me see if I can reproduce it with crunch.max.running.jobs set to 1 to
> try to narrow down the originating failure.
>
>
>
>
>>
>> J
>>
>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <everett@nuna.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I recently started trying to run our Crunch pipeline on more data and
>>> have been trying out different AWS instance types in anticipation of our
>>> storage and compute needs.
>>>
>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the
>>> CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>
>>> Our pipeline finishes fine in these cluster configurations:
>>>
>>>    - 50 c3.4xlarge Core, 0 Task
>>>    - 10 c3.8xlarge Core, 0 Task
>>>    - 25 c3.8xlarge Core, 0 Task
>>>
>>> However, it always fails on the same data when using 10 cc2.8xlarge Core
>>> instances.
>>>
>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>> hard disks instead of SSDs.
>>>
>>> While it's a little hard to track down the exact originating failure, I
>>> think it's from errors like:
>>>
>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>> org.apache.crunch.CrunchRuntimeException:
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>> File does not exist. Holder
>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>> any open files.
>>>
>>> Those paths look like these side effect files
>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>> .
>>>
>>> Would Crunch have generated applications that depend on side effect
>>> paths as input across MapReduce applications and something in HDFS is
>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>> configures Hadoop differently for each instance type, and might have more
>>> aggressive cleanup settings on HDs, though this is very uninformed
>>> hypothesis.
>>>
>>> A sample full log is attached.
>>>
>>> Thanks for any guidance!
>>>
>>> - Everett
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Mime
View raw message