Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAH29n6Npv2awTrwHJr19xa_x-2ED+SpL=K21cT9zc4oXMPJGvQ@mail.gmail.com>
References: 
 <CABc3QxEf8osVaA=QT0fvLEJhjooMhxo9SftMtQ3QiYoxx=j0OA@mail.gmail.com>
	<CAH29n6MWHeZ1cVyi4xupFW05OC+H5v7O8i+jOOcUDNzpdF4d9Q@mail.gmail.com>
	<CABc3QxHiOK0cJJp+_qKkEZG4axMzK44sqTgzqVbYeGj_2E794Q@mail.gmail.com>
	<CABc3QxFc2nHJiR6nMRX5Efg9-ve858JMfns41mBzNOzTMk8Ctg@mail.gmail.com>
	<CAH29n6PdrE1suaYHOnSc9yF+j2FT7hagKiC4krDLZHJ9qMwyuA@mail.gmail.com>
	<CAH29n6OU7Q6Y3ondHszPJL19YzDzxSkBOLokgeS79=1oYpPZDg@mail.gmail.com>
	<CABc3QxFfxDNxsRRcV7NPmN=Nsrj5v+yhQ+w8AXKKJApiL81UZA@mail.gmail.com>
	<CAH29n6Npv2awTrwHJr19xa_x-2ED+SpL=K21cT9zc4oXMPJGvQ@mail.gmail.com>
Date: Fri, 21 Aug 2015 13:03:38 -0700
Message-ID: 
 <CABc3QxHQpb9NWvC83zc_-tnsCbS3hJGwmHSmp0qdjyCzURGkmA@mail.gmail.com>
Subject: Re: LeaseExpiredExceptions and temp side effect files
From: Everett Anderson <everett@nuna.com>
To: user@crunch.apache.org
Cc: Jeff Quinn <jeff@nuna.com>
Content-Type: multipart/alternative; boundary=f46d043bdf5a8384be051dd7c325

--f46d043bdf5a8384be051dd7c325
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hey,

Jeff graciously agreed to try it out.

I'm afraid we're still getting failures on that instance type, though with
0.11 with the patches, the cluster ended up in a state that no new
applications could be submitted afterwards.

The errors when running the pipeline seem to be similarly HDFS related.
It's quite odd.

Examples when using 0.11 + the patches:


2015-08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14401026=
43297_out0_0107_r_000001_0/out0-r-00001"
- Aborting...


2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod=
e.LeaseExpiredException):
No lease on
/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_144010264=
3297_out12_0103_r_000167_2/out12-r-00167
(inode 83784): File does not exist. [Lease.  Holder:
DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
pendingcreates: 24]
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem=
.java:3516)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesyst=
em.java:3486)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameN=
odeRpcServer.java:687)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato=
rPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa=
menodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot=
obufRpcEngine.java:635)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j=
ava:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.ja=
va:241)
at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abando=
nBlock(ClientNamenodeProtocolTranslatorPB.java:376)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp=
l.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocat=
ionHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHan=
dler.java:102)
at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D=
FSOutputStream.java:1377)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav=
a:594)
2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_14401026=
43297_out12_0103_r_000167_2/out12-r-00167"
- Aborting...


2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream=
(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D=
FSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav=
a:594)
2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Abandoning BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
2015-08-20 23:34:59,278 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Excluding datanode 10.55.1.103:50010
2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D=
FSOutputStream.java:1386)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav=
a:594)
2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14401026=
43297_out0_0107_r_000001_2/out0-r-00001"
- Aborting...
2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : org.apache.crunch.CrunchRuntimeException:
java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.j=
ava:74)
at
org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j=
ava:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
Caused by: java.io.IOException: Bad connect ack with firstBadLink as
10.55.1.103:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream=
(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D=
FSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav=
a:594)


On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jwills@cloudera.com> wrote:

> Curious how this went. :)
>
> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <everett@nuna.com>
> wrote:
>
>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>
>> https://issues.apache.org/jira/browse/CRUNCH-553
>> https://issues.apache.org/jira/browse/CRUNCH-517
>>
>> as we also rely on 517.
>>
>>
>>
>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> (In particular, I'm wondering if something in CRUNCH-481 is related to
>>> this problem.)
>>>
>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jwills@cloudera.com> wrote=
:
>>>
>>>> Hey Everett,
>>>>
>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
>>>> patch? Is that easy to do?
>>>>
>>>> J
>>>>
>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <everett@nuna.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge hardwar=
e
>>>>> when setting crunch.max.running.jobs to 1. I generally feel like the
>>>>> pipeline application itself logic is sound, at this point. It could b=
e that
>>>>> this is just taxing these machines too hard and we need to increase t=
he
>>>>> number of retries?
>>>>>
>>>>> It reliably fails on this hardware when crunch.max.running.jobs set
>>>>> to its default.
>>>>>
>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>> well as how Crunch uses side effect files? Do you know if HDFS would =
clean
>>>>> up those directories from underneath Crunch?
>>>>>
>>>>> There are usually 4 failed applications, failing due to reduces. The
>>>>> failures seem to be one of the following three kinds -- (1) No lease =
on
>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3=
)
>>>>> SocketTimeoutException.
>>>>>
>>>>> Examples:
>>>>>
>>>>> [1] No lease exception
>>>>>
>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.n=
amenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_14399=
17295505_out7_0018_r_000003_1/out7-r-00003:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not =
have
>>>>> any open files. at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSName=
system.java:2944)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInter=
nal(FSNamesystem.java:3008)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNa=
mesystem.java:2988)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(Nam=
eNodeRpcServer.java:641)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra=
nslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl=
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java=
)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal=
l(ProtobufRpcEngine.java:599)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1548)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskCon=
text.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.jav=
a:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656=
) at
>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1548)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused=
 by:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.n=
amenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_14399=
17295505_out7_0018_r_000003_1/out7-r-00003:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not =
have
>>>>> any open files. at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSName=
system.java:2944)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInter=
nal(FSNamesystem.java:3008)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNa=
mesystem.java:2988)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(Nam=
eNodeRpcServer.java:641)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra=
nslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl=
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java=
)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal=
l(ProtobufRpcEngine.java:599)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1548)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng=
ine.java:215)
>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.=
java:57)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces=
sorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI=
nvocationHandler.java:190)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat=
ionHandler.java:103)
>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.=
complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.j=
ava:2130)
>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:=
2114)
>>>>> at
>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOut=
putStream.java:72)
>>>>> at
>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java=
:105)
>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1=
289)
>>>>> at
>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.clo=
se(SequenceFileOutputFormat.java:87)
>>>>> at
>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.ja=
va:300)
>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) a=
t
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskCon=
text.java:72)
>>>>> ... 9 more
>>>>>
>>>>>
>>>>> [2] File does not exist
>>>>>
>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apac=
he.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report fro=
m attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRun=
timeException: Could not read runtime node information
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTask=
Context.java:48)
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.j=
ava:40)
>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java=
:656)
>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInf=
ormation.java:1548)
>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/c=
runch-4694113/p470/REDUCE
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFil=
e.java:65)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFil=
e.java:55)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat=
ionsUpdateTimes(FSNamesystem.java:1726)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat=
ionsInt(FSNamesystem.java:1669)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat=
ions(FSNamesystem.java:1649)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat=
ions(FSNamesystem.java:1621)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlock=
Locations(NameNodeRpcServer.java:497)
>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSid=
eTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorP=
B.java:322)
>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProto=
s$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.=
java)
>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker=
.call(ProtobufRpcEngine.java:599)
>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInf=
ormation.java:1548)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>
>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Met=
hod)
>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConst=
ructorAccessorImpl.java:57)
>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegat=
ingConstructorAccessorImpl.java:45)
>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(Remote=
Exception.java:106)
>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remot=
eException.java:73)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.=
java:1147)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:=
1135)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:=
1125)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLas=
tBlockLength(DFSInputStream.java:273)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.jav=
a:240)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:=
233)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distributed=
FileSystem.java:300)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distributed=
FileSystem.java:296)
>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLin=
kResolver.java:81)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFile=
System.java:296)
>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTask=
Context.java:46)
>>>>> 	... 9 more
>>>>>
>>>>> [3] SocketTimeoutException
>>>>>
>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeo=
utException: 70000 millis timeout while waiting for channel to be ready for=
 read. ch : java.nio.channels.SocketChannel[connected local=3D/10.55.1.229:=
35720 remote=3D/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTa=
skContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.r=
un.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapred=
uce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.ru=
nNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run=
(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild=
.java:175) at java.security.AccessController.doPrivileged(Native Method) at=
 javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.se=
curity.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apa=
che.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.So=
cketTimeoutException: 70000 millis timeout while waiting for channel to be =
ready for read. ch : java.nio.channels.SocketChannel[connected local=3D/10.=
55.1.229:35720 remote=3D/10.55.1.230:9200] at org.apache.hadoop.net.SocketI=
OWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.So=
cketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.S=
ocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.=
SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputSt=
ream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(Filt=
erInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPr=
efixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataSt=
reamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOut=
putStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:10=
42) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineFor=
AppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOu=
tputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at o=
rg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java=
:491)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <everett@nuna.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwills@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Everett,
>>>>>>>
>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>> exceptions, and their usually more symptomatic of other problems in=
 the
>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on t=
he
>>>>>>> non-SSD instances are failing for some other reason? I'd be surpris=
ed if no
>>>>>>> other errors showed up in the app master, although there are report=
s of
>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're=
 not
>>>>>>> doing that here, right?
>>>>>>>
>>>>>>
>>>>>> We're reading from and writing to HDFS, here. (We've copied in input
>>>>>> from S3 to HDFS in another step.)
>>>>>>
>>>>>> There are a few exceptions in the logs. Most seem related to missing
>>>>>> temp files.
>>>>>>
>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set to
>>>>>> 1 to try to narrow down the originating failure.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <everett@nuna.com=
>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I recently started trying to run our Crunch pipeline on more data
>>>>>>>> and have been trying out different AWS instance types in anticipat=
ion of
>>>>>>>> our storage and compute needs.
>>>>>>>>
>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>> with the CRUNCH-553
>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>
>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>
>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>
>>>>>>>> However, it always fails on the same data when using 10 cc2.8xlarg=
e
>>>>>>>> Core instances.
>>>>>>>>
>>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges
>>>>>>>> use hard disks instead of SSDs.
>>>>>>>>
>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>> failure, I think it's from errors like:
>>>>>>>>
>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.serve=
r.namenode.LeaseExpiredException):
>>>>>>>> No lease on
>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_=
1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>> File does not exist. Holder
>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does n=
ot have
>>>>>>>> any open files.
>>>>>>>>
>>>>>>>> Those paths look like these side effect files
>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapre=
d/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)=
>
>>>>>>>> .
>>>>>>>>
>>>>>>>> Would Crunch have generated applications that depend on side effec=
t
>>>>>>>> paths as input across MapReduce applications and something in HDFS=
 is
>>>>>>>> cleaning up those paths, unaware of the higher level dependencies?=
 AWS
>>>>>>>> configures Hadoop differently for each instance type, and might ha=
ve more
>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>> hypothesis.
>>>>>>>>
>>>>>>>> A sample full log is attached.
>>>>>>>>
>>>>>>>> Thanks for any guidance!
>>>>>>>>
>>>>>>>> - Everett
>>>>>>>>
>>>>>>>>
>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>> attachments, may contain information that is confidential, proprie=
tary in
>>>>>>>> nature, protected health information (PHI), or otherwise protected=
 by law
>>>>>>>> from disclosure, and is solely for the use of the intended recipie=
nt(s). If
>>>>>>>> you are not the intended recipient, you are hereby notified that a=
ny use,
>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>> unauthorized and strictly prohibited. If you have received this em=
ail in
>>>>>>>> error, please notify the sender of this email. Please delete this =
and all
>>>>>>>> copies of this email from your system. Any opinions either express=
ed or
>>>>>>>> implied in this email and all attachments, are those of its author=
 only,
>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law fro=
m
>>>>> disclosure, and is solely for the use of the intended recipient(s). I=
f you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email=
 in
>>>>> error, please notify the sender of this email. Please delete this and=
 all
>>>>> copies of this email from your system. Any opinions either expressed =
or
>>>>> implied in this email and all attachments, are those of its author on=
ly,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protect=
ed
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not t=
he
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of th=
is
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

--=20
*DISCLAIMER:* The contents of this email, including any attachments, may=20
contain information that is confidential, proprietary in nature, protected=
=20
health information (PHI), or otherwise protected by law from disclosure,=20
and is solely for the use of the intended recipient(s). If you are not the=
=20
intended recipient, you are hereby notified that any use, disclosure or=20
copying of this email, including any attachments, is unauthorized and=20
strictly prohibited. If you have received this email in error, please=20
notify the sender of this email. Please delete this and all copies of this=
=20
email from your system. Any opinions either expressed or implied in this=20
email and all attachments, are those of its author only, and do not=20
necessarily reflect those of Nuna Health, Inc.

--f46d043bdf5a8384be051dd7c325
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hey,<div><br></div><div>Jeff graciously agreed to try it o=
ut.</div><div><br></div><div>I&#39;m afraid we&#39;re still getting failure=
s on that instance type, though with 0.11 with the patches, the cluster end=
ed up in a state that no new applications could be submitted afterwards.</d=
iv><div><br></div><div>The errors when running the pipeline seem to be simi=
larly HDFS related. It&#39;s quite odd.</div><div><br></div><div>Examples w=
hen using 0.11 + the patches:</div><div><br></div><div><br></div><div>2015-=
08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient: Could=
 not get block locations. Source file &quot;/tmp/crunch-274499863/p504/outp=
ut/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-=
r-00001&quot; - Aborting...<br></div><div><br></div><div><br></div><div><di=
v><div><div><div>2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop=
.hdfs.DFSClient: DataStreamer Exception</div><div>org.apache.hadoop.ipc.Rem=
oteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):=
 No lease on /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/atte=
mpt_1440102643297_out12_0103_r_000167_2/out12-r-00167 (inode 83784): File d=
oes not exist. [Lease.=C2=A0 Holder: DFSClient_attempt_1440102643297_0103_r=
_000167_2_964529009_1, pendingcreates: 24]</div><div><span class=3D"Apple-t=
ab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.hdfs.server=
.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)</div><div><span c=
lass=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.had=
oop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)<=
/div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>=
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(Na=
meNodeRpcServer.java:687)</div><div><span class=3D"Apple-tab-span" style=3D=
"white-space:pre">	</span>at org.apache.hadoop.hdfs.protocolPB.ClientNameno=
deProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerS=
ideTranslatorPB.java:467)</div><div><span class=3D"Apple-tab-span" style=3D=
"white-space:pre">	</span>at org.apache.hadoop.hdfs.protocol.proto.ClientNa=
menodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientName=
nodeProtocolProtos.java)</div><div><span class=3D"Apple-tab-span" style=3D"=
white-space:pre">	</span>at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$=
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)</div><div><span class=
=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.=
ipc.RPC$Server.call(RPC.java:962)</div><div><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at org.apache.hadoop.ipc.Server$Handler$1=
.run(Server.java:2039)</div><div><span class=3D"Apple-tab-span" style=3D"wh=
ite-space:pre">	</span>at org.apache.hadoop.ipc.Server$Handler$1.run(Server=
.java:2035)</div><div><span class=3D"Apple-tab-span" style=3D"white-space:p=
re">	</span>at java.security.AccessController.doPrivileged(Native Method)</=
div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>a=
t javax.security.auth.Subject.doAs(Subject.java:415)</div><div><span class=
=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.=
security.UserGroupInformation.doAs(UserGroupInformation.java:1628)</div><di=
v><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.a=
pache.hadoop.ipc.Server$Handler.run(Server.java:2033)</div><div><br></div><=
div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org=
.apache.hadoop.ipc.Client.call(Client.java:1468)</div><div><span class=3D"A=
pple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.ipc.C=
lient.call(Client.java:1399)</div><div><span class=3D"Apple-tab-span" style=
=3D"white-space:pre">	</span>at org.apache.hadoop.ipc.ProtobufRpcEngine$Inv=
oker.invoke(ProtobufRpcEngine.java:241)</div><div><span class=3D"Apple-tab-=
span" style=3D"white-space:pre">	</span>at com.sun.proxy.$Proxy13.abandonBl=
ock(Unknown Source)</div><div><span class=3D"Apple-tab-span" style=3D"white=
-space:pre">	</span>at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProt=
ocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)<=
/div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>=
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)</div><div><s=
pan class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at sun.refle=
ct.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43=
)</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</spa=
n>at java.lang.reflect.Method.invoke(Method.java:606)</div><div><span class=
=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.=
io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:18=
7)</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</sp=
an>at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat=
ionHandler.java:102)</div><div><span class=3D"Apple-tab-span" style=3D"whit=
e-space:pre">	</span>at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)=
</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span=
>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStre=
am(DFSOutputStream.java:1377)</div><div><span class=3D"Apple-tab-span" styl=
e=3D"white-space:pre">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$Dat=
aStreamer.run(DFSOutputStream.java:594)</div><div>2015-08-20 22:39:42,184 W=
ARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: Could not get block locat=
ions. Source file &quot;/tmp/crunch-274499863/p510/output/_temporary/1/_tem=
porary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167&quot; - Ab=
orting...</div></div></div></div></div><div><br></div><div><br></div><div><=
br></div><div><div>2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hado=
op.hdfs.DFSClient: Exception in createBlockOutputStream</div><div>java.io.I=
OException: Bad connect ack with firstBadLink as <a href=3D"http://10.55.1.=
103:50010">10.55.1.103:50010</a></div><div><span class=3D"Apple-tab-span" s=
tyle=3D"white-space:pre">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$=
DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)</div><div><=
span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apac=
he.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutput=
Stream.java:1373)</div><div><span class=3D"Apple-tab-span" style=3D"white-s=
pace:pre">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.ru=
n(DFSOutputStream.java:594)</div><div>2015-08-20 23:34:59,276 INFO [Thread-=
37] org.apache.hadoop.hdfs.DFSClient: Abandoning BP-835517662-10.55.1.32-14=
40102626965:blk_1073828261_95268</div><div>2015-08-20 23:34:59,278 INFO [Th=
read-37] org.apache.hadoop.hdfs.DFSClient: Excluding datanode <a href=3D"ht=
tp://10.55.1.103:50010">10.55.1.103:50010</a></div><div>2015-08-20 23:34:59=
,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient: DataStreamer Except=
ion</div><div>java.io.IOException: Unable to create new block.</div><div><s=
pan class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apach=
e.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputS=
tream.java:1386)</div><div><span class=3D"Apple-tab-span" style=3D"white-sp=
ace:pre">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run=
(DFSOutputStream.java:594)</div><div>2015-08-20 23:34:59,278 WARN [Thread-3=
7] org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source =
file &quot;/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attemp=
t_1440102643297_out0_0107_r_000001_2/out0-r-00001&quot; - Aborting...</div>=
<div>2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild=
: Exception running child : org.apache.crunch.CrunchRuntimeException: java.=
io.IOException: Bad connect ack with firstBadLink as <a href=3D"http://10.5=
5.1.103:50010">10.55.1.103:50010</a></div><div><span class=3D"Apple-tab-spa=
n" style=3D"white-space:pre">	</span>at org.apache.crunch.impl.mr.run.Crunc=
hTaskContext.cleanup(CrunchTaskContext.java:74)</div><div><span class=3D"Ap=
ple-tab-span" style=3D"white-space:pre">	</span>at org.apache.crunch.impl.m=
r.run.CrunchReducer.cleanup(CrunchReducer.java:64)</div><div><span class=3D=
"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.map=
reduce.Reducer.run(Reducer.java:195)</div><div><span class=3D"Apple-tab-spa=
n" style=3D"white-space:pre">	</span>at org.apache.hadoop.mapred.ReduceTask=
.runNewReducer(ReduceTask.java:656)</div><div><span class=3D"Apple-tab-span=
" style=3D"white-space:pre">	</span>at org.apache.hadoop.mapred.ReduceTask.=
run(ReduceTask.java:394)</div><div><span class=3D"Apple-tab-span" style=3D"=
white-space:pre">	</span>at org.apache.hadoop.mapred.YarnChild$2.run(YarnCh=
ild.java:171)</div><div><span class=3D"Apple-tab-span" style=3D"white-space=
:pre">	</span>at java.security.AccessController.doPrivileged(Native Method)=
</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span=
>at javax.security.auth.Subject.doAs(Subject.java:415)</div><div><span clas=
s=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop=
.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)</div><d=
iv><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.=
apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)</div><div>Caused by=
: java.io.IOException: Bad connect ack with firstBadLink as <a href=3D"http=
://10.55.1.103:50010">10.55.1.103:50010</a></div><div><span class=3D"Apple-=
tab-span" style=3D"white-space:pre">	</span>at org.apache.hadoop.hdfs.DFSOu=
tputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)<=
/div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>=
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStrea=
m(DFSOutputStream.java:1373)</div><div><span class=3D"Apple-tab-span" style=
=3D"white-space:pre">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$Data=
Streamer.run(DFSOutputStream.java:594)</div></div><div><br></div><div><br><=
/div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><=
/div><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmai=
l_quote">On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <span dir=3D"ltr">&lt=
;<a href=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwills@cloudera.c=
om</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"=
>Curious how this went. :)</div><div class=3D"HOEnZb"><div class=3D"h5"><di=
v class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 2015=
 at 4:26 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:evere=
tt@nuna.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div dir=3D"ltr">Sure, let me give it a try. I=
&#39;m going to take 0.11 and patch it with<div><br></div><div><a href=3D"h=
ttps://issues.apache.org/jira/browse/CRUNCH-553" target=3D"_blank">https://=
issues.apache.org/jira/browse/CRUNCH-553</a><br></div><div><a href=3D"https=
://issues.apache.org/jira/browse/CRUNCH-517" target=3D"_blank">https://issu=
es.apache.org/jira/browse/CRUNCH-517</a><br></div><div><br></div><div>as we=
 also rely on 517.</div><div><br></div><div><br></div></div><div><div><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 2015 a=
t 4:09 PM, Josh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloude=
ra.com" target=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><div dir=3D"ltr">(In particular, I&#39;m wonderi=
ng if something in CRUNCH-481 is related to this problem.)</div><div><div><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 20=
15 at 4:07 PM, Josh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cl=
oudera.com" target=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hey Everett,<div><br></div>=
<div>Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553=
 patch? Is that easy to do?</div><span><font color=3D"#888888"><div><br></d=
iv><div>J</div></font></span></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Tue, Aug 18, 2015 at 3:18 PM, Everett Anders=
on <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@nuna.com" target=3D"_bla=
nk">everett@nuna.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr">Hi,<div><br></div><div>I verified that the pipeline succ=
eeds on the same cc2.8xlarge hardware when setting <a href=3D"http://crunch=
.max.running.jobs" target=3D"_blank">crunch.max.running.jobs</a> to 1. I ge=
nerally feel like the pipeline application itself logic is sound, at this p=
oint. It could be that this is just taxing these machines too hard and we n=
eed to increase the number of retries?</div><div><br></div><div>It reliably=
 fails on this hardware when <a href=3D"http://crunch.max.running.jobs" tar=
get=3D"_blank">crunch.max.running.jobs</a> set to its default.</div><div><b=
r></div><div>Can you explain a little what the /tmp/crunch-XXXXXXX files ar=
e as well as how Crunch uses side effect files? Do you know if HDFS would c=
lean up those directories from underneath Crunch?</div><div><br></div><div>=
There are usually 4 failed applications, failing due to reduces. The failur=
es seem to be one of the following three kinds -- (1) No lease on &lt;side =
effect file&gt;, (2) File not found &lt;/tmp/crunch-XXXXXXX&gt; file, (3) S=
ocketTimeoutException.</div><div><br></div><div>Examples:</div><div><br></d=
iv><div>[1] No lease exception</div><div><br></div><div><div><font face=3D"=
monospace, monospace">Error: org.apache.crunch.CrunchRuntimeException: org.=
apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.Le=
aseExpiredException): No lease on /tmp/crunch-4694113/p662/output/_temporar=
y/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: Fil=
e does not exist. Holder DFSClient_attempt_1439917295505_0018_r_000003_1_82=
4053899_1 does not have any open files. at org.apache.hadoop.hdfs.server.na=
menode.FSNamesystem.checkLease(FSNamesystem.java:2944) at org.apache.hadoop=
.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3=
008) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FS=
Namesystem.java:2988) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpc=
Server.complete(NameNodeRpcServer.java:641) at org.apache.hadoop.hdfs.proto=
colPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeP=
rotocolServerSideTranslatorPB.java:484) at org.apache.hadoop.hdfs.protocol.=
proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMet=
hod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpc=
Engine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) at org.ap=
ache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Serv=
er$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$=
1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Nati=
ve Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apa=
che.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:154=
8) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apa=
che.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)=
 at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:=
64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apa=
che.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apac=
he.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.m=
apred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController=
.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.ja=
va:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.j=
ava:170) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop=
.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/crunch-46941=
13/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_00=
0003_1/out7-r-00003: File does not exist. Holder DFSClient_attempt_14399172=
95505_0018_r_000003_1_824053899_1 does not have any open files. at org.apac=
he.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:29=
44) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInte=
rnal(FSNamesystem.java:3008) at org.apache.hadoop.hdfs.server.namenode.FSNa=
mesystem.completeFile(FSNamesystem.java:2988) at org.apache.hadoop.hdfs.ser=
ver.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) at org.=
apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.=
complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) at org.apac=
he.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodePr=
otocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apach=
e.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEn=
gine.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at or=
g.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.ha=
doop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessCont=
roller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subj=
ect.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserG=
roupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Serv=
er.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org=
.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.P=
rotobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215) at com.sun.prox=
y.$Proxy13.complete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl=
.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Nati=
veMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.i=
nvoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.inv=
oke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.i=
nvokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.=
RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.p=
roxy.$Proxy13.complete(Unknown Source) at org.apache.hadoop.hdfs.protocolPB=
.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslat=
orPB.java:404) at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOu=
tputStream.java:2130) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOu=
tputStream.java:2114) at org.apache.hadoop.fs.FSDataOutputStream$PositionCa=
che.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputS=
tream.close(FSDataOutputStream.java:105) at org.apache.hadoop.io.SequenceFi=
le$Writer.close(SequenceFile.java:1289) at org.apache.hadoop.mapreduce.lib.=
output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87) a=
t org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:3=
00) at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at =
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.j=
ava:72) ... 9 more</font></div></div><div><br></div><div><br></div><div><sp=
an style=3D"color:rgb(0,0,0);white-space:pre-wrap">[2] File does not exist<=
/span></div><div><span style=3D"color:rgb(0,0,0);white-space:pre-wrap"><br>=
</span></div><div><pre style=3D"color:rgb(0,0,0);word-wrap:break-word;white=
-space:pre-wrap">2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handle=
r] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics=
 report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunc=
h.CrunchRuntimeException: Could not read runtime node information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40=
)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-=
4694113/p470/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUp=
dateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsIn=
t(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocati=
ons(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTrans=
latorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java=
:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Clie=
ntNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(=
ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor=
AccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon=
structorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExcept=
ion.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExcep=
tion.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1=
147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlock=
Length(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.&lt;init&gt;(DFSInputStream.java:=
233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResol=
ver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem=
.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:46)
	... 9 more</pre></div><div><span style=3D"color:rgb(0,0,0);white-space:pre=
-wrap">[3] </span><font color=3D"#000000"><span style=3D"white-space:pre-wr=
ap">SocketTimeoutException</span></font></div><div><pre style=3D"word-wrap:=
break-word"><font color=3D"#000000"><span style=3D"white-space:pre-wrap">Er=
ror: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutExcept=
ion: 70000 millis timeout while waiting for channel to be ready for read. c=
h : java.nio.channels.SocketChannel[connected local=3D/<a href=3D"http://10=
.55.1.229:35720" target=3D"_blank">10.55.1.229:35720</a> remote=3D/<a href=
=3D"http://10.55.1.230:9200" target=3D"_blank">10.55.1.230:9200</a>] at org=
.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java=
:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.j=
ava:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org=
.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.=
apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hado=
op.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessContro=
ller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subjec=
t.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro=
upInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChi=
ld.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeo=
ut while waiting for channel to be ready for read. ch : java.nio.channels.S=
ocketChannel[connected local=3D/<a href=3D"http://10.55.1.229:35720" target=
=3D"_blank">10.55.1.229:35720</a> remote=3D/<a href=3D"http://10.55.1.230:9=
200" target=3D"_blank">10.55.1.230:9200</a>] at org.apache.hadoop.net.Socke=
tIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.=
SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net=
.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.ne=
t.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInput=
Stream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(Fi=
lterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vint=
Prefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$Data=
Streamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSO=
utputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:=
1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineF=
orAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFS=
OutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at=
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.ja=
va:491)
</span></font></pre><div><font color=3D"#000000"><span style=3D"white-space=
:pre-wrap"><br></span></font></div><pre style=3D"color:rgb(0,0,0);word-wrap=
:break-word;white-space:pre-wrap"><br></pre></div><div><br></div><div><br><=
/div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><=
/div><div><br></div><div><br></div></div><div><div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Fri, Aug 14, 2015 at 3:54 PM, Everett =
Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@nuna.com" target=
=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote"><span>On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <span dir=
=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwill=
s@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">Hey Everett,<div><br></div><div>Initial thought-- there are lo=
ts of reasons for lease expired exceptions, and their usually more symptoma=
tic of other problems in the pipeline. Are you sure none of the jobs in the=
 Crunch pipeline on the non-SSD instances are failing for some other reason=
? I&#39;d be surprised if no other errors showed up in the app master, alth=
ough there are reports of some weirdness around LeaseExpireds when writing =
to S3-- but you&#39;re not doing that here, right?</div></div></blockquote>=
<div><br></div></span><div>We&#39;re reading from and writing to HDFS, here=
. (We&#39;ve copied in input from S3 to HDFS in another step.)</div><div><b=
r></div><div>There are a few exceptions in the logs. Most seem related to m=
issing temp files.</div><div><br></div><div>Let me see if I can reproduce i=
t with=C2=A0<a href=3D"http://crunch.max.running.jobs" target=3D"_blank">cr=
unch.max.running.jobs</a> set to 1 to try to narrow down the originating fa=
ilure.</div><span><div><br></div><div><br></div><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>J</div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote"><div><div>On Fri, Aug 14, 2=
015 at 2:10 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:ev=
erett@nuna.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br=
></div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div><div><div dir=3D"ltr">Hi,<=
div><br></div><div>I recently started trying to run our Crunch pipeline on =
more data and have been trying out different AWS instance types in anticipa=
tion of our storage and compute needs.</div><div><br></div><div>I was using=
 EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the <a href=3D"ht=
tps://issues.apache.org/jira/browse/CRUNCH-553" target=3D"_blank">CRUNCH-55=
3</a> fix).</div><div><br></div><div>Our pipeline finishes fine in these cl=
uster configurations:</div><div><ul><li>50 c3.4xlarge Core, 0 Task</li><li>=
10 c3.8xlarge Core, 0 Task</li><li>25 c3.8xlarge Core, 0 Task</li></ul></di=
v><div>However, it always fails on the same data when using 10 cc2.8xlarge =
Core instances.</div><div><br></div><div>The biggest obvious hardware diffe=
rence is that the cc2.8xlarges use hard disks instead of SSDs.</div><div><b=
r></div><div>While it&#39;s a little hard to track down the exact originati=
ng failure, I think it&#39;s from errors like:</div><div><br></div><div>201=
5-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] org.apache.hado=
op.mapred.TaskAttemptListenerImpl: Task: attempt_1439499407003_0028_r_00015=
3_1 - exited : org.apache.crunch.CrunchRuntimeException: org.apache.hadoop.=
ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredExce=
ption): No lease on /tmp/crunch-970849245/p662/output/_temporary/1/_tempora=
ry/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153: File does not e=
xist. Holder DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 do=
es not have any open files.<br></div><div><br></div><div>Those paths look l=
ike <a href=3D"https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/=
mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.Job=
Conf)" target=3D"_blank">these side effect files</a>.</div><div><br></div><=
div>Would Crunch have generated applications that depend on side effect pat=
hs as input across MapReduce applications and something in HDFS is cleaning=
 up those paths, unaware of the higher level dependencies? AWS configures H=
adoop differently for each instance type, and might have more aggressive cl=
eanup settings on HDs, though this is very uninformed hypothesis.</div><div=
><br></div><div>A sample full log is attached.</div><div><br></div><div>Tha=
nks for any guidance!</div><div><br></div><div>- Everett</div><div><br></di=
v></div>

<br>
</div></div><font size=3D"2" color=3D"#808080"><b style=3D"font-family:Cali=
bri,sans-serif;background-color:rgb(255,255,255)">DISCLAIMER:</b><span styl=
e=3D"font-family:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=
=A0The contents of this email, including any attachments, may contain infor=
mation that is confidential, proprietary in nature, protected health inform=
ation (PHI), or otherwise protected by law from disclosure, and is solely f=
or the use of the intended recipient(s). If you are not the intended recipi=
ent, you are hereby notified that any use, disclosure or copying of this em=
ail, including any attachments, is unauthorized and strictly prohibited. If=
 you have received this email in error, please notify the sender of this em=
ail. Please delete this and all copies of this email from your system. Any =
opinions either expressed or implied in this email and all attachments, are=
 those of its author only, and do not necessarily reflect those of Nuna Hea=
lth, Inc.</span></font></blockquote></div><span><font color=3D"#888888"><br=
><br clear=3D"all"><div><br></div>-- <br><div><div>Director of Data Science=
</div><div><a href=3D"http://www.cloudera.com" target=3D"_blank">Cloudera</=
a></div><div>Twitter: <a href=3D"http://twitter.com/josh_wills" target=3D"_=
blank">@josh_wills</a></div></div>
</font></span></div></div>
</blockquote></span></div><br></div></div>
</blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div><div>Director of Data Science</div><div><a href=3D"http://ww=
w.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=
=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><=
/div>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div><div>Director of Data Science</div><div><a href=3D"http://www.cloudera=
.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://t=
witter.com/josh_wills" target=3D"_blank">@josh_wills</a></div></div>
</div>
</div></div></blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div><div>Director of Data Science</div><div><a href=3D"http://ww=
w.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=
=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><=
/div>
</div>
</div></div></blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font>
--f46d043bdf5a8384be051dd7c325--