Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CABc3QxFDaQ7uC8yo+T0i2hfac6cs2OV+8dZGiZYBSos40eE1Sw@mail.gmail.com>
References: 
 <CABc3QxEf8osVaA=QT0fvLEJhjooMhxo9SftMtQ3QiYoxx=j0OA@mail.gmail.com>
 <CAH29n6MWHeZ1cVyi4xupFW05OC+H5v7O8i+jOOcUDNzpdF4d9Q@mail.gmail.com>
 <CABc3QxHiOK0cJJp+_qKkEZG4axMzK44sqTgzqVbYeGj_2E794Q@mail.gmail.com>
 <CABc3QxFc2nHJiR6nMRX5Efg9-ve858JMfns41mBzNOzTMk8Ctg@mail.gmail.com>
 <CAH29n6PdrE1suaYHOnSc9yF+j2FT7hagKiC4krDLZHJ9qMwyuA@mail.gmail.com>
 <CAH29n6OU7Q6Y3ondHszPJL19YzDzxSkBOLokgeS79=1oYpPZDg@mail.gmail.com>
 <CABc3QxFfxDNxsRRcV7NPmN=Nsrj5v+yhQ+w8AXKKJApiL81UZA@mail.gmail.com>
 <CAH29n6Npv2awTrwHJr19xa_x-2ED+SpL=K21cT9zc4oXMPJGvQ@mail.gmail.com>
 <CABc3QxHQpb9NWvC83zc_-tnsCbS3hJGwmHSmp0qdjyCzURGkmA@mail.gmail.com>
 <CAMQQebCevjmHmphhZoYafYYQ6Xh1uV5m_GnSSJbm14GiuP5ESg@mail.gmail.com>
 <CABc3QxHTzpO+MtGziAGk+0hn0GF76wRyZCbxuZ+ZqfPygvgDHg@mail.gmail.com>
 <CAH29n6MQW8cpS8b+ui_wYRLz6x1KAfBSJuCEZASa72fq6U7etg@mail.gmail.com>
 <CABc3QxFDaQ7uC8yo+T0i2hfac6cs2OV+8dZGiZYBSos40eE1Sw@mail.gmail.com>
From: Josh Wills <jwills@cloudera.com>
Date: Tue, 29 Sep 2015 16:46:13 -0400
Message-ID: 
 <CAH29n6PEJyH=g_zYJLCxe8dZ8A4n3QPhBu3Wr2yUBTc=TqsT2Q@mail.gmail.com>
Subject: Re: LeaseExpiredExceptions and temp side effect files
To: user@crunch.apache.org
Cc: Jeff Quinn <jeff@nuna.com>, Rahul Gupta-Iwasaki <rahul@nuna.com>
Content-Type: multipart/alternative; boundary=001a113a7caabafbea0520e8e872

--001a113a7caabafbea0520e8e872
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Yeah, that makes sense to me-- not totally trivial to do, but it should be
possible.

J

On Tue, Sep 29, 2015 at 4:42 PM, Everett Anderson <everett@nuna.com> wrote:

> Hey,
>
> We have some leads. Increasing the datanode memory seems to help the
> immediate issue.
>
> However, we need a solution to our buildup of temporary outputs. We're
> exploring segmenting our pipeline with run()/cleanup() calls.
>
> I'm curious, though --
>
> Do you think it'd be possible for us to make a Crunch modification to
> optionally actively cleanup temporary outputs? It seems like the planner
> would know what those are.
>
> A temporary output would be any PCollection that isn't referenced by
> outside of Crunch (or perhaps ones that aren't explicitly marked as cache=
d).
>
>
> On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hrm. If you never call Pipeline.done, you should never cleanup the
>> temporary files for the job...
>>
>> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <everett@nuna.com>
>> wrote:
>>
>>> While we tried to take comfort in the fact that we'd only seen this onl=
y
>>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>>> larger amounts of data on SSD-based c3.4x8larges.
>>>
>>> My two hypotheses are
>>>
>>> 1) Somehow these temp files are getting cleaned up before they're
>>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>>> cleans up these temp directories, or perhaps there's a bunch in Crunch'=
s
>>> planner.
>>>
>>> 2) HDFS has chosen 3 machines to replicate data to, but it is performin=
g
>>> a very lopsided replication. While the cluster overall looks like it ha=
s
>>> HDFS capacity, perhaps a small subset of the machines is actually at
>>> capacity. Things seem to fail in obscure ways when running out of disk.
>>>
>>>
>>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild:=
 Exception running child : org.apache.crunch.CrunchRuntimeException: Could =
not read runtime node information
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskCo=
ntext.java:48)
>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.jav=
a:40)
>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:6=
56)
>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor=
mation.java:1548)
>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/cru=
nch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.=
java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.=
java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
nsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
nsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
ns(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
ns(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLo=
cations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideT=
ranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.=
java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$=
ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.ja=
va)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.c=
all(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor=
mation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Metho=
d)
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstru=
ctorAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegatin=
gConstructorAccessorImpl.java:45)
>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteEx=
ception.java:106)
>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteE=
xception.java:73)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.ja=
va:1147)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:11=
35)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:11=
25)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastB=
lockLength(DFSInputStream.java:273)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:=
240)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:23=
3)
>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFi=
leSystem.java:300)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFi=
leSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkR=
esolver.java:81)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSy=
stem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskCo=
ntext.java:46)
>>> 	... 9 more
>>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundEx=
ception): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.=
java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.=
java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
nsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
nsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
ns(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocatio=
ns(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLo=
cations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideT=
ranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.=
java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$=
ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.ja=
va)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.c=
all(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor=
mation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcE=
ngine.java:215)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp=
l.java:57)
>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc=
essorImpl.java:43)
>>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Retr=
yInvocationHandler.java:190)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvoc=
ationHandler.java:103)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorP=
B.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.ja=
va:1145)
>>> 	... 22 more
>>>
>>>
>>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <jeff@nuna.com> wrote:
>>>
>>>> Also worth noting, we inspected the hadoop configuration defaults that
>>>> the AWS EMR service populates for the two different instance types, fo=
r
>>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>>> identical, with the exception of slight differences in JVM memory allo=
tted.
>>>> Further investigated the max number of file descriptors for each insta=
nce
>>>> type via ulimit, and saw no differences there either.
>>>>
>>>> So not sure what the main difference is between these two clusters tha=
t
>>>> would cause these very different outcomes, other than cc2.8xlarge havi=
ng
>>>> SSDs and c3.8xlarge having spinning disks.
>>>>
>>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <everett@nuna.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Jeff graciously agreed to try it out.
>>>>>
>>>>> I'm afraid we're still getting failures on that instance type, though
>>>>> with 0.11 with the patches, the cluster ended up in a state that no n=
ew
>>>>> applications could be submitted afterwards.
>>>>>
>>>>> The errors when running the pipeline seem to be similarly HDFS
>>>>> related. It's quite odd.
>>>>>
>>>>> Examples when using 0.11 + the patches:
>>>>>
>>>>>
>>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Sour=
ce
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14=
40102643297_out0_0107_r_000001_0/out0-r-00001"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.n=
amenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_144=
0102643297_out12_0103_r_000167_2/out12-r-00167
>>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>>> pendingcreates: 24]
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSName=
system.java:3516)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNa=
mesystem.java:3486)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock=
(NameNodeRpcServer.java:687)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra=
nslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:46=
7)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl=
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java=
)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal=
l(ProtobufRpcEngine.java:635)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1628)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>>
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng=
ine.java:241)
>>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.=
abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces=
sorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI=
nvocationHandler.java:187)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat=
ionHandler.java:102)
>>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputSt=
ream(DFSOutputStream.java:1377)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStre=
am.java:594)
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Sour=
ce
>>>>> file
>>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_14=
40102643297_out12_0103_r_000167_2/out12-r-00167"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>>
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStrea=
m
>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutput=
Stream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputSt=
ream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStre=
am.java:594)
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:5001=
0
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> java.io.IOException: Unable to create new block.
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputSt=
ream(DFSOutputStream.java:1386)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStre=
am.java:594)
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Sour=
ce
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14=
40102643297_out0_0107_r_000001_2/out0-r-00001"
>>>>> - Aborting...
>>>>> 2015-08-20 23:34:59,279 WARN [main]
>>>>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>>>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad co=
nnect
>>>>> ack with firstBadLink as 10.55.1.103:50010
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskCon=
text.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.jav=
a:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>>> at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656=
)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1628)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutput=
Stream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputSt=
ream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStre=
am.java:594)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jwills@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Curious how this went. :)
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <everett@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it wit=
h
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>>
>>>>>>> as we also rely on 517.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jwills@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is relate=
d
>>>>>>>> to this problem.)
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jwills@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Everett,
>>>>>>>>>
>>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/th=
e
>>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>>> feel like the pipeline application itself logic is sound, at thi=
s point. It
>>>>>>>>>> could be that this is just taxing these machines too hard and we=
 need to
>>>>>>>>>> increase the number of retries?
>>>>>>>>>>
>>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>>> set to its default.
>>>>>>>>>>
>>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are
>>>>>>>>>> as well as how Crunch uses side effect files? Do you know if HDF=
S would
>>>>>>>>>> clean up those directories from underneath Crunch?
>>>>>>>>>>
>>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>>> The failures seem to be one of the following three kinds -- (1) =
No lease on
>>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> fil=
e, (3)
>>>>>>>>>> SocketTimeoutException.
>>>>>>>>>>
>>>>>>>>>> Examples:
>>>>>>>>>>
>>>>>>>>>> [1] No lease exception
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.ser=
ver.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_=
1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does=
 not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(F=
SNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile=
Internal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile=
(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complet=
e(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi=
deTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:4=
84)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProt=
os$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos=
.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoke=
r.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTa=
skContext.java:74)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReduce=
r.java:64)
>>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.jav=
a:656) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1548)
>>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) C=
aused by:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.ser=
ver.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_=
1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does=
 not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(F=
SNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile=
Internal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile=
(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complet=
e(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi=
deTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:4=
84)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProt=
os$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos=
.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoke=
r.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufR=
pcEngine.java:215)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessor=
Impl.java:57)
>>>>>>>>>> at
>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod=
AccessorImpl.java:43)
>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(R=
etryInvocationHandler.java:190)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryIn=
vocationHandler.java:103)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat=
orPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStr=
eam.java:2130)
>>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.=
java:2114)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDa=
taOutputStream.java:72)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream=
.java:105)
>>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.j=
ava:1289)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$=
1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutpu=
ts.java:300)
>>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:1=
80) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTa=
skContext.java:72)
>>>>>>>>>> ... 9 more
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2] File does not exist
>>>>>>>>>>
>>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org=
.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics repor=
t from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.Crun=
chRuntimeException: Could not read runtime node information
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(Crunc=
hTaskContext.java:48)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchRedu=
cer.java:40)
>>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask=
.java:656)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro=
upInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /=
tmp/crunch-4694113/p470/REDUCE
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INo=
deFile.java:65)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INo=
deFile.java:55)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock=
LocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock=
LocationsInt(FSNamesystem.java:1669)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock=
Locations(FSNamesystem.java:1649)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock=
Locations(FSNamesystem.java:1621)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.get=
BlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServ=
erSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTransl=
atorPB.java:322)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocol=
Protos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolPr=
otos.java)
>>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcIn=
voker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro=
upInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>>
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Nativ=
e Method)
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native=
ConstructorAccessorImpl.java:57)
>>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De=
legatingConstructorAccessorImpl.java:45)
>>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:5=
26)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(R=
emoteException.java:106)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(=
RemoteException.java:73)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSCl=
ient.java:1147)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.=
java:1135)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.=
java:1125)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndG=
etLastBlockLength(DFSInputStream.java:273)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStrea=
m.java:240)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.=
java:233)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distri=
butedFileSystem.java:300)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distri=
butedFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSyst=
emLinkResolver.java:81)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(Distribute=
dFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(Crunc=
hTaskContext.java:46)
>>>>>>>>>> 	... 9 more
>>>>>>>>>>
>>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.Socket=
TimeoutException: 70000 millis timeout while waiting for channel to be read=
y for read. ch : java.nio.channels.SocketChannel[connected local=3D/10.55.1=
.229:35720 remote=3D/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.Cru=
nchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl=
.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.m=
apreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTa=
sk.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTas=
k.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(Yarn=
Child.java:175) at java.security.AccessController.doPrivileged(Native Metho=
d) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hado=
op.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at or=
g.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.n=
et.SocketTimeoutException: 70000 millis timeout while waiting for channel t=
o be ready for read. ch : java.nio.channels.SocketChannel[connected local=
=3D/10.55.1.229:35720 remote=3D/10.55.1.230:9200] at org.apache.hadoop.net.=
SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop=
.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoo=
p.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hado=
op.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.Filter=
InputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.re=
ad(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper=
.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream=
$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs=
.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.=
java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipe=
lineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdf=
s.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:93=
5) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStre=
am.java:491)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwills@cloudera.co=
m
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>>
>>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>>> exceptions, and their usually more symptomatic of other proble=
ms in the
>>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline=
 on the
>>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be su=
rprised if no
>>>>>>>>>>>> other errors showed up in the app master, although there are r=
eports of
>>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but y=
ou're not
>>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>>
>>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>>> missing temp files.
>>>>>>>>>>>
>>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> J
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>>> data and have been trying out different AWS instance types in=
 anticipation
>>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12
>>>>>>>>>>>>> (patched with the CRUNCH-553
>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711=
]
>>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.=
server.namenode.LeaseExpiredException):
>>>>>>>>>>>>> No lease on
>>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/att=
empt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 d=
oes not have
>>>>>>>>>>>>> any open files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/=
mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.Job=
Conf)>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>>> effect paths as input across MapReduce applications and somet=
hing in HDFS
>>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level depen=
dencies? AWS
>>>>>>>>>>>>> configures Hadoop differently for each instance type, and mig=
ht have more
>>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninf=
ormed
>>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Everett
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>>> attachments, may contain information that is confidential, pr=
oprietary in
>>>>>>>>>>>>> nature, protected health information (PHI), or otherwise prot=
ected by law
>>>>>>>>>>>>> from disclosure, and is solely for the use of the intended re=
cipient(s). If
>>>>>>>>>>>>> you are not the intended recipient, you are hereby notified t=
hat any use,
>>>>>>>>>>>>> disclosure or copying of this email, including any attachment=
s, is
>>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received th=
is email in
>>>>>>>>>>>>> error, please notify the sender of this email. Please delete =
this and all
>>>>>>>>>>>>> copies of this email from your system. Any opinions either ex=
pressed or
>>>>>>>>>>>>> implied in this email and all attachments, are those of its a=
uthor only,
>>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Director of Data Science
>>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>> attachments, may contain information that is confidential, propr=
ietary in
>>>>>>>>>> nature, protected health information (PHI), or otherwise protect=
ed by law
>>>>>>>>>> from disclosure, and is solely for the use of the intended recip=
ient(s). If
>>>>>>>>>> you are not the intended recipient, you are hereby notified that=
 any use,
>>>>>>>>>> disclosure or copying of this email, including any attachments, =
is
>>>>>>>>>> unauthorized and strictly prohibited. If you have received this =
email in
>>>>>>>>>> error, please notify the sender of this email. Please delete thi=
s and all
>>>>>>>>>> copies of this email from your system. Any opinions either expre=
ssed or
>>>>>>>>>> implied in this email and all attachments, are those of its auth=
or only,
>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, propriet=
ary in
>>>>>>> nature, protected health information (PHI), or otherwise protected =
by law
>>>>>>> from disclosure, and is solely for the use of the intended recipien=
t(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that an=
y use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this ema=
il in
>>>>>>> error, please notify the sender of this email. Please delete this a=
nd all
>>>>>>> copies of this email from your system. Any opinions either expresse=
d or
>>>>>>> implied in this email and all attachments, are those of its author =
only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If =
you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email i=
n
>>> error, please notify the sender of this email. Please delete this and a=
ll
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only=
,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protecte=
d
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not th=
e
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of thi=
s
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>


--=20
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--001a113a7caabafbea0520e8e872
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yeah, that makes sense to me-- not totally trivial to do, =
but it should be possible.<div><br></div><div>J</div></div><div class=3D"gm=
ail_extra"><br><div class=3D"gmail_quote">On Tue, Sep 29, 2015 at 4:42 PM, =
Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@nuna.com" =
target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex"><div dir=3D"ltr">Hey,<div><br></div><div>We have some leads=
. Increasing the datanode memory seems to help the immediate issue.</div><d=
iv><br></div><div>However, we need a solution to our buildup of temporary o=
utputs. We&#39;re exploring segmenting our pipeline with run()/cleanup() ca=
lls.</div><div><br></div><div>I&#39;m curious, though --=C2=A0</div><div><b=
r></div><div>Do you think it&#39;d be possible for us to make a Crunch modi=
fication to optionally actively cleanup temporary outputs? It seems like th=
e planner would know what those are.</div><div><br></div><div>A temporary o=
utput would be any PCollection that isn&#39;t referenced by outside of Crun=
ch (or perhaps ones that aren&#39;t explicitly marked as cached).</div><div=
><br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmai=
l_extra"><br><div class=3D"gmail_quote">On Thu, Sep 24, 2015 at 5:46 PM, Jo=
sh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" targe=
t=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Hrm. If you never call Pipeline.done, you sh=
ould never cleanup the temporary files for the job...</div><div><div><div c=
lass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Sep 24, 2015 at=
 5:44 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@=
nuna.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"><div dir=3D"ltr">While we tried to take comfort i=
n the fact that we&#39;d only seen this only HD-based cc2.8xlarges, I&#39;m=
 afraid we&#39;re now seeing it when processing larger amounts of data on S=
SD-based c3.4x8larges.<div><br></div><div>My two hypotheses are</div><div><=
br></div><div>1) Somehow these temp files are getting cleaned up before the=
y&#39;re accessed for the last time. Perhaps either something in HDFS or Ha=
doop cleans up these temp directories, or perhaps there&#39;s a bunch in Cr=
unch&#39;s planner.</div><div><br></div><div>2) HDFS has chosen 3 machines =
to replicate data to, but it is performing a very lopsided replication. Whi=
le the cluster overall looks like it has HDFS capacity, perhaps a small sub=
set of the machines is actually at capacity. Things seem to fail in obscure=
 ways when running out of disk.</div><div><br></div><div><div><br></div><di=
v><pre style=3D"white-space:pre-wrap;color:rgb(0,0,0);word-wrap:break-word"=
>2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Ex=
ception running child : org.apache.crunch.CrunchRuntimeException: Could not=
 read runtime node information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40=
)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-=
2031291770/p567/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUp=
dateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsIn=
t(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocati=
ons(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTrans=
latorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java=
:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Clie=
ntNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(=
ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor=
AccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon=
structorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExcept=
ion.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExcep=
tion.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1=
147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlock=
Length(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.&lt;init&gt;(DFSInputStream.java:=
233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResol=
ver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem=
.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:46)
	... 9 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundExcept=
ion): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUp=
dateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsIn=
t(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocati=
ons(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTrans=
latorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java=
:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Clie=
ntNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(=
ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin=
e.java:215)
	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja=
va:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso=
rImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv=
ocationHandler.java:190)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio=
nHandler.java:103)
	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.ge=
tBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1=
145)
	... 22 more</pre></div></div></div><div><div><div class=3D"gmail_extra"><b=
r><div class=3D"gmail_quote">On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:jeff@nuna.com" target=3D"_blank">jeff=
@nuna.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Also worth noting, we inspected the hadoop configuration defaults =
that the AWS EMR service populates for the two different instance types, fo=
r mapred-site.xml, core-site.xml, and=C2=A0hdfs-site.xml all=C2=A0settings =
were identical, with the exception of slight differences in JVM memory allo=
tted. Further investigated the max number of file descriptors for each inst=
ance type via ulimit, and saw no differences there either.=C2=A0<div><br></=
div><div>So not sure what the main difference is between these two clusters=
 that would cause these very different outcomes, other than cc2.8xlarge hav=
ing SSDs and c3.8xlarge having spinning disks.</div></div><div><div><div cl=
ass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Aug 21, 2015 at =
1:03 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@n=
una.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex"><div dir=3D"ltr">Hey,<div><br></div><div>Jeff grac=
iously agreed to try it out.</div><div><br></div><div>I&#39;m afraid we&#39=
;re still getting failures on that instance type, though with 0.11 with the=
 patches, the cluster ended up in a state that no new applications could be=
 submitted afterwards.</div><div><br></div><div>The errors when running the=
 pipeline seem to be similarly HDFS related. It&#39;s quite odd.</div><div>=
<br></div><div>Examples when using 0.11 + the patches:</div><div><br></div>=
<div><br></div><div>2015-08-20 23:17:50,455 WARN [Thread-38] org.apache.had=
oop.hdfs.DFSClient: Could not get block locations. Source file &quot;/tmp/c=
runch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_o=
ut0_0107_r_000001_0/out0-r-00001&quot; - Aborting...<br></div><div><br></di=
v><div><br></div><div><div><div><div><div>2015-08-20 22:39:42,184 WARN [Thr=
ead-51] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception</div><div>=
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod=
e.LeaseExpiredException): No lease on /tmp/crunch-274499863/p510/output/_te=
mporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00=
167 (inode 83784): File does not exist. [Lease.=C2=A0 Holder: DFSClient_att=
empt_1440102643297_0103_r_000167_2_964529009_1, pendingcreates: 24]</div><d=
iv><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.s=
erver.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)</div><div><s=
pan style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.server=
.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)</div><div><span=
 style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.server.na=
menode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)</div><div=
><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.pro=
tocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNam=
enodeProtocolServerSideTranslatorPB.java:467)</div><div><span style=3D"whit=
e-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.protocol.proto.ClientNa=
menodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientName=
nodeProtocolProtos.java)</div><div><span style=3D"white-space:pre-wrap">	</=
span>at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.c=
all(ProtobufRpcEngine.java:635)</div><div><span style=3D"white-space:pre-wr=
ap">	</span>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)</div><di=
v><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.ipc.Ser=
ver$Handler$1.run(Server.java:2039)</div><div><span style=3D"white-space:pr=
e-wrap">	</span>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2=
035)</div><span><div><span style=3D"white-space:pre-wrap">	</span>at java.s=
ecurity.AccessController.doPrivileged(Native Method)</div><div><span style=
=3D"white-space:pre-wrap">	</span>at javax.security.auth.Subject.doAs(Subje=
ct.java:415)</div></span><div><span style=3D"white-space:pre-wrap">	</span>=
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio=
n.java:1628)</div><div><span style=3D"white-space:pre-wrap">	</span>at org.=
apache.hadoop.ipc.Server$Handler.run(Server.java:2033)</div><div><br></div>=
<div><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.ipc.=
Client.call(Client.java:1468)</div><div><span style=3D"white-space:pre-wrap=
">	</span>at org.apache.hadoop.ipc.Client.call(Client.java:1399)</div><div>=
<span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.ipc.Proto=
bufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)</div><div><span sty=
le=3D"white-space:pre-wrap">	</span>at com.sun.proxy.$Proxy13.abandonBlock(=
Unknown Source)</div><div><span style=3D"white-space:pre-wrap">	</span>at o=
rg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandon=
Block(ClientNamenodeProtocolTranslatorPB.java:376)</div><div><span style=3D=
"white-space:pre-wrap">	</span>at sun.reflect.GeneratedMethodAccessor9.invo=
ke(Unknown Source)</div><span><div><span style=3D"white-space:pre-wrap">	</=
span>at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc=
essorImpl.java:43)</div><div><span style=3D"white-space:pre-wrap">	</span>a=
t java.lang.reflect.Method.invoke(Method.java:606)</div></span><div><span s=
tyle=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.io.retry.RetryIn=
vocationHandler.invokeMethod(RetryInvocationHandler.java:187)</div><div><sp=
an style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.io.retry.Ret=
ryInvocationHandler.invoke(RetryInvocationHandler.java:102)</div><div><span=
 style=3D"white-space:pre-wrap">	</span>at com.sun.proxy.$Proxy14.abandonBl=
ock(Unknown Source)</div><div><span style=3D"white-space:pre-wrap">	</span>=
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStrea=
m(DFSOutputStream.java:1377)</div><div><span style=3D"white-space:pre-wrap"=
>	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutp=
utStream.java:594)</div><div>2015-08-20 22:39:42,184 WARN [Thread-51] org.a=
pache.hadoop.hdfs.DFSClient: Could not get block locations. Source file &qu=
ot;/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_144010=
2643297_out12_0103_r_000167_2/out12-r-00167&quot; - Aborting...</div></div>=
</div></div></div><div><br></div><div><br></div><div><br></div><div><div>20=
15-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Ex=
ception in createBlockOutputStream</div><div>java.io.IOException: Bad conne=
ct ack with firstBadLink as <a href=3D"http://10.55.1.103:50010" target=3D"=
_blank">10.55.1.103:50010</a></div><div><span style=3D"white-space:pre-wrap=
">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBloc=
kOutputStream(DFSOutputStream.java:1472)</div><div><span style=3D"white-spa=
ce:pre-wrap">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer=
.nextBlockOutputStream(DFSOutputStream.java:1373)</div><div><span style=3D"=
white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$Dat=
aStreamer.run(DFSOutputStream.java:594)</div><div>2015-08-20 23:34:59,276 I=
NFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Abandoning BP-835517662-1=
0.55.1.32-1440102626965:blk_1073828261_95268</div><div>2015-08-20 23:34:59,=
278 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Excluding datanode <=
a href=3D"http://10.55.1.103:50010" target=3D"_blank">10.55.1.103:50010</a>=
</div><div>2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.=
DFSClient: DataStreamer Exception</div><div>java.io.IOException: Unable to =
create new block.</div><div><span style=3D"white-space:pre-wrap">	</span>at=
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(=
DFSOutputStream.java:1386)</div><div><span style=3D"white-space:pre-wrap">	=
</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutput=
Stream.java:594)</div><div>2015-08-20 23:34:59,278 WARN [Thread-37] org.apa=
che.hadoop.hdfs.DFSClient: Could not get block locations. Source file &quot=
;/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14401026=
43297_out0_0107_r_000001_2/out0-r-00001&quot; - Aborting...</div><div>2015-=
08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild: Exceptio=
n running child : org.apache.crunch.CrunchRuntimeException: java.io.IOExcep=
tion: Bad connect ack with firstBadLink as <a href=3D"http://10.55.1.103:50=
010" target=3D"_blank">10.55.1.103:50010</a></div><span><div><span style=3D=
"white-space:pre-wrap">	</span>at org.apache.crunch.impl.mr.run.CrunchTaskC=
ontext.cleanup(CrunchTaskContext.java:74)</div><div><span style=3D"white-sp=
ace:pre-wrap">	</span>at org.apache.crunch.impl.mr.run.CrunchReducer.cleanu=
p(CrunchReducer.java:64)</div><div><span style=3D"white-space:pre-wrap">	</=
span>at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)</div><div=
><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.mapred.R=
educeTask.runNewReducer(ReduceTask.java:656)</div><div><span style=3D"white=
-space:pre-wrap">	</span>at org.apache.hadoop.mapred.ReduceTask.run(ReduceT=
ask.java:394)</div></span><div><span style=3D"white-space:pre-wrap">	</span=
>at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)</div><span=
><div><span style=3D"white-space:pre-wrap">	</span>at java.security.AccessC=
ontroller.doPrivileged(Native Method)</div><div><span style=3D"white-space:=
pre-wrap">	</span>at javax.security.auth.Subject.doAs(Subject.java:415)</di=
v></span><div><span style=3D"white-space:pre-wrap">	</span>at org.apache.ha=
doop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)</di=
v><div><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.ma=
pred.YarnChild.main(YarnChild.java:166)</div><div>Caused by: java.io.IOExce=
ption: Bad connect ack with firstBadLink as <a href=3D"http://10.55.1.103:5=
0010" target=3D"_blank">10.55.1.103:50010</a></div><div><span style=3D"whit=
e-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.DFSOutputStream$DataStr=
eamer.createBlockOutputStream(DFSOutputStream.java:1472)</div><div><span st=
yle=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.DFSOutputStr=
eam$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)</div><div=
><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.hdfs.DFS=
OutputStream$DataStreamer.run(DFSOutputStream.java:594)</div></div><div><br=
></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br=
></div><div><br></div><div><br></div></div><div><div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">On Fri, Aug 21, 2015 at 11:59 AM, Josh =
Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" target=
=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Curious how this went. :)</div><div><div><di=
v class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 2015=
 at 4:26 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:evere=
tt@nuna.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div dir=3D"ltr">Sure, let me give it a try. I=
&#39;m going to take 0.11 and patch it with<div><br></div><div><a href=3D"h=
ttps://issues.apache.org/jira/browse/CRUNCH-553" target=3D"_blank">https://=
issues.apache.org/jira/browse/CRUNCH-553</a><br></div><div><a href=3D"https=
://issues.apache.org/jira/browse/CRUNCH-517" target=3D"_blank">https://issu=
es.apache.org/jira/browse/CRUNCH-517</a><br></div><div><br></div><div>as we=
 also rely on 517.</div><div><br></div><div><br></div></div><div><div><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 2015 a=
t 4:09 PM, Josh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloude=
ra.com" target=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><div dir=3D"ltr">(In particular, I&#39;m wonderi=
ng if something in CRUNCH-481 is related to this problem.)</div><div><div><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 18, 20=
15 at 4:07 PM, Josh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cl=
oudera.com" target=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hey Everett,<div><br></div>=
<div>Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553=
 patch? Is that easy to do?</div><span><font color=3D"#888888"><div><br></d=
iv><div>J</div></font></span></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Tue, Aug 18, 2015 at 3:18 PM, Everett Anders=
on <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@nuna.com" target=3D"_bla=
nk">everett@nuna.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr">Hi,<div><br></div><div>I verified that the pipeline succ=
eeds on the same cc2.8xlarge hardware when setting <a href=3D"http://crunch=
.max.running.jobs" target=3D"_blank">crunch.max.running.jobs</a> to 1. I ge=
nerally feel like the pipeline application itself logic is sound, at this p=
oint. It could be that this is just taxing these machines too hard and we n=
eed to increase the number of retries?</div><div><br></div><div>It reliably=
 fails on this hardware when <a href=3D"http://crunch.max.running.jobs" tar=
get=3D"_blank">crunch.max.running.jobs</a> set to its default.</div><div><b=
r></div><div>Can you explain a little what the /tmp/crunch-XXXXXXX files ar=
e as well as how Crunch uses side effect files? Do you know if HDFS would c=
lean up those directories from underneath Crunch?</div><div><br></div><div>=
There are usually 4 failed applications, failing due to reduces. The failur=
es seem to be one of the following three kinds -- (1) No lease on &lt;side =
effect file&gt;, (2) File not found &lt;/tmp/crunch-XXXXXXX&gt; file, (3) S=
ocketTimeoutException.</div><div><br></div><div>Examples:</div><div><br></d=
iv><div>[1] No lease exception</div><div><br></div><div><div><font face=3D"=
monospace, monospace">Error: org.apache.crunch.CrunchRuntimeException: org.=
apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.Le=
aseExpiredException): No lease on /tmp/crunch-4694113/p662/output/_temporar=
y/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: Fil=
e does not exist. Holder DFSClient_attempt_1439917295505_0018_r_000003_1_82=
4053899_1 does not have any open files. at org.apache.hadoop.hdfs.server.na=
menode.FSNamesystem.checkLease(FSNamesystem.java:2944) at org.apache.hadoop=
.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3=
008) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FS=
Namesystem.java:2988) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpc=
Server.complete(NameNodeRpcServer.java:641) at org.apache.hadoop.hdfs.proto=
colPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeP=
rotocolServerSideTranslatorPB.java:484) at org.apache.hadoop.hdfs.protocol.=
proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMet=
hod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpc=
Engine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) at org.ap=
ache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Serv=
er$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$=
1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Nati=
ve Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apa=
che.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:154=
8) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apa=
che.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)=
 at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:=
64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apa=
che.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apac=
he.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.m=
apred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController=
.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.ja=
va:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.j=
ava:170) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop=
.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/crunch-46941=
13/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_00=
0003_1/out7-r-00003: File does not exist. Holder DFSClient_attempt_14399172=
95505_0018_r_000003_1_824053899_1 does not have any open files. at org.apac=
he.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:29=
44) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInte=
rnal(FSNamesystem.java:3008) at org.apache.hadoop.hdfs.server.namenode.FSNa=
mesystem.completeFile(FSNamesystem.java:2988) at org.apache.hadoop.hdfs.ser=
ver.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) at org.=
apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.=
complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) at org.apac=
he.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodePr=
otocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apach=
e.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEn=
gine.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at or=
g.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.ha=
doop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessCont=
roller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subj=
ect.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserG=
roupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Serv=
er.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org=
.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.P=
rotobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215) at com.sun.prox=
y.$Proxy13.complete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl=
.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Nati=
veMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.i=
nvoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.inv=
oke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.i=
nvokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.=
RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.p=
roxy.$Proxy13.complete(Unknown Source) at org.apache.hadoop.hdfs.protocolPB=
.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslat=
orPB.java:404) at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOu=
tputStream.java:2130) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOu=
tputStream.java:2114) at org.apache.hadoop.fs.FSDataOutputStream$PositionCa=
che.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputS=
tream.close(FSDataOutputStream.java:105) at org.apache.hadoop.io.SequenceFi=
le$Writer.close(SequenceFile.java:1289) at org.apache.hadoop.mapreduce.lib.=
output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87) a=
t org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:3=
00) at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at =
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.j=
ava:72) ... 9 more</font></div></div><div><br></div><div><br></div><div><sp=
an style=3D"color:rgb(0,0,0);white-space:pre-wrap">[2] File does not exist<=
/span></div><div><span style=3D"color:rgb(0,0,0);white-space:pre-wrap"><br>=
</span></div><div><pre style=3D"color:rgb(0,0,0);word-wrap:break-word;white=
-space:pre-wrap">2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handle=
r] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics=
 report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunc=
h.CrunchRuntimeException: Could not read runtime node information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40=
)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-=
4694113/p470/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUp=
dateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsIn=
t(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocati=
ons(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTrans=
latorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java=
:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Clie=
ntNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(=
ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor=
AccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon=
structorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExcept=
ion.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExcep=
tion.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1=
147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlock=
Length(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.&lt;init&gt;(DFSInputStream.java:=
233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResol=
ver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem=
.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.&lt;init&gt;(CrunchTask=
Context.java:46)
	... 9 more</pre></div><div><span style=3D"color:rgb(0,0,0);white-space:pre=
-wrap">[3] </span><font color=3D"#000000"><span style=3D"white-space:pre-wr=
ap">SocketTimeoutException</span></font></div><div><pre style=3D"word-wrap:=
break-word"><font color=3D"#000000"><span style=3D"white-space:pre-wrap">Er=
ror: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutExcept=
ion: 70000 millis timeout while waiting for channel to be ready for read. c=
h : java.nio.channels.SocketChannel[connected local=3D/<a href=3D"http://10=
.55.1.229:35720" target=3D"_blank">10.55.1.229:35720</a> remote=3D/<a href=
=3D"http://10.55.1.230:9200" target=3D"_blank">10.55.1.230:9200</a>] at org=
.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java=
:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.j=
ava:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org=
.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.=
apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hado=
op.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessContro=
ller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subjec=
t.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro=
upInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChi=
ld.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeo=
ut while waiting for channel to be ready for read. ch : java.nio.channels.S=
ocketChannel[connected local=3D/<a href=3D"http://10.55.1.229:35720" target=
=3D"_blank">10.55.1.229:35720</a> remote=3D/<a href=3D"http://10.55.1.230:9=
200" target=3D"_blank">10.55.1.230:9200</a>] at org.apache.hadoop.net.Socke=
tIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.=
SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net=
.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.ne=
t.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInput=
Stream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(Fi=
lterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vint=
Prefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$Data=
Streamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSO=
utputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:=
1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineF=
orAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFS=
OutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at=
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.ja=
va:491)
</span></font></pre><div><font color=3D"#000000"><span style=3D"white-space=
:pre-wrap"><br></span></font></div><pre style=3D"color:rgb(0,0,0);word-wrap=
:break-word;white-space:pre-wrap"><br></pre></div><div><br></div><div><br><=
/div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><=
/div><div><br></div><div><br></div></div><div><div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Fri, Aug 14, 2015 at 3:54 PM, Everett =
Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:everett@nuna.com" target=
=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote"><span>On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <span dir=
=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwill=
s@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">Hey Everett,<div><br></div><div>Initial thought-- there are lo=
ts of reasons for lease expired exceptions, and their usually more symptoma=
tic of other problems in the pipeline. Are you sure none of the jobs in the=
 Crunch pipeline on the non-SSD instances are failing for some other reason=
? I&#39;d be surprised if no other errors showed up in the app master, alth=
ough there are reports of some weirdness around LeaseExpireds when writing =
to S3-- but you&#39;re not doing that here, right?</div></div></blockquote>=
<div><br></div></span><div>We&#39;re reading from and writing to HDFS, here=
. (We&#39;ve copied in input from S3 to HDFS in another step.)</div><div><b=
r></div><div>There are a few exceptions in the logs. Most seem related to m=
issing temp files.</div><div><br></div><div>Let me see if I can reproduce i=
t with=C2=A0<a href=3D"http://crunch.max.running.jobs" target=3D"_blank">cr=
unch.max.running.jobs</a> set to 1 to try to narrow down the originating fa=
ilure.</div><span><div><br></div><div><br></div><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>J</div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote"><div><div>On Fri, Aug 14, 2=
015 at 2:10 PM, Everett Anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:ev=
erett@nuna.com" target=3D"_blank">everett@nuna.com</a>&gt;</span> wrote:<br=
></div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div><div><div dir=3D"ltr">Hi,<=
div><br></div><div>I recently started trying to run our Crunch pipeline on =
more data and have been trying out different AWS instance types in anticipa=
tion of our storage and compute needs.</div><div><br></div><div>I was using=
 EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the <a href=3D"ht=
tps://issues.apache.org/jira/browse/CRUNCH-553" target=3D"_blank">CRUNCH-55=
3</a> fix).</div><div><br></div><div>Our pipeline finishes fine in these cl=
uster configurations:</div><div><ul><li>50 c3.4xlarge Core, 0 Task</li><li>=
10 c3.8xlarge Core, 0 Task</li><li>25 c3.8xlarge Core, 0 Task</li></ul></di=
v><div>However, it always fails on the same data when using 10 cc2.8xlarge =
Core instances.</div><div><br></div><div>The biggest obvious hardware diffe=
rence is that the cc2.8xlarges use hard disks instead of SSDs.</div><div><b=
r></div><div>While it&#39;s a little hard to track down the exact originati=
ng failure, I think it&#39;s from errors like:</div><div><br></div><div>201=
5-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] org.apache.hado=
op.mapred.TaskAttemptListenerImpl: Task: attempt_1439499407003_0028_r_00015=
3_1 - exited : org.apache.crunch.CrunchRuntimeException: org.apache.hadoop.=
ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredExce=
ption): No lease on /tmp/crunch-970849245/p662/output/_temporary/1/_tempora=
ry/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153: File does not e=
xist. Holder DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 do=
es not have any open files.<br></div><div><br></div><div>Those paths look l=
ike <a href=3D"https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/=
mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.Job=
Conf)" target=3D"_blank">these side effect files</a>.</div><div><br></div><=
div>Would Crunch have generated applications that depend on side effect pat=
hs as input across MapReduce applications and something in HDFS is cleaning=
 up those paths, unaware of the higher level dependencies? AWS configures H=
adoop differently for each instance type, and might have more aggressive cl=
eanup settings on HDs, though this is very uninformed hypothesis.</div><div=
><br></div><div>A sample full log is attached.</div><div><br></div><div>Tha=
nks for any guidance!</div><div><br></div><div>- Everett</div><div><br></di=
v></div>

<br>
</div></div><font size=3D"2" color=3D"#808080"><b style=3D"font-family:Cali=
bri,sans-serif;background-color:rgb(255,255,255)">DISCLAIMER:</b><span styl=
e=3D"font-family:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=
=A0The contents of this email, including any attachments, may contain infor=
mation that is confidential, proprietary in nature, protected health inform=
ation (PHI), or otherwise protected by law from disclosure, and is solely f=
or the use of the intended recipient(s). If you are not the intended recipi=
ent, you are hereby notified that any use, disclosure or copying of this em=
ail, including any attachments, is unauthorized and strictly prohibited. If=
 you have received this email in error, please notify the sender of this em=
ail. Please delete this and all copies of this email from your system. Any =
opinions either expressed or implied in this email and all attachments, are=
 those of its author only, and do not necessarily reflect those of Nuna Hea=
lth, Inc.</span></font></blockquote></div><span><font color=3D"#888888"><br=
><br clear=3D"all"><div><br></div>-- <br><div><div>Director of Data Science=
</div><div><a href=3D"http://www.cloudera.com" target=3D"_blank">Cloudera</=
a></div><div>Twitter: <a href=3D"http://twitter.com/josh_wills" target=3D"_=
blank">@josh_wills</a></div></div>
</font></span></div></div>
</blockquote></span></div><br></div></div>
</blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div><div>Director of Data Science</div><div><a href=3D"http://ww=
w.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=
=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><=
/div>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div><div>Director of Data Science</div><div><a href=3D"http://www.cloudera=
.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://t=
witter.com/josh_wills" target=3D"_blank">@josh_wills</a></div></div>
</div>
</div></div></blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div><div>Director of Data Science</div><div><a href=3D"http://ww=
w.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=
=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><=
/div>
</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div><div>Director of Data Science</div><div><a href=3D"http://ww=
w.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=
=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><=
/div>
</div>
</div></div></blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font></div></div></blockquote></div><br><br clear=3D"all"><div><br></d=
iv>-- <br><div class=3D"gmail_signature"><div>Director of Data Science</div=
><div><a href=3D"http://www.cloudera.com" target=3D"_blank">Cloudera</a></d=
iv><div>Twitter: <a href=3D"http://twitter.com/josh_wills" target=3D"_blank=
">@josh_wills</a></div></div>
</div>

--001a113a7caabafbea0520e8e872--