Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4278D18193 for ; Fri, 21 Aug 2015 20:03:57 +0000 (UTC) Received: (qmail 86351 invoked by uid 500); 21 Aug 2015 20:03:54 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 86312 invoked by uid 500); 21 Aug 2015 20:03:54 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 86302 invoked by uid 99); 21 Aug 2015 20:03:54 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Aug 2015 20:03:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 864971826D6 for ; Fri, 21 Aug 2015 20:03:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.874 X-Spam-Level: *** X-Spam-Status: No, score=3.874 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=nuna.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id vAOz4lZBgLci for ; Fri, 21 Aug 2015 20:03:40 +0000 (UTC) Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id D13A842B18 for ; Fri, 21 Aug 2015 20:03:39 +0000 (UTC) Received: by wicne3 with SMTP id ne3so24068946wic.0 for ; Fri, 21 Aug 2015 13:03:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nuna.com; s=nuna; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=nPyHaTjPpfllxJfND0IPGuXPodYOvz803HoBv8EpZzc=; b=Hqo2aICtEc+J/HCFLbAAek/PulKCtEq4Tmmt0ByMTa1F6r7jpSJlkjXXTRF1EeuoDl wppzdFQXCiuU8yw7VJVStoWVOuaLyh76qLVC2rJKxSOKHOGIAXuN2oxvTYU4vGtjodEy Yf90f5OwM4vYZOzgZwAiOns+FpMtiztAI+6Qw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=nPyHaTjPpfllxJfND0IPGuXPodYOvz803HoBv8EpZzc=; b=XcI5+eEFUq/KxSSQMI1K2wRG0F8t2OSiuBh5MdqvdLkWd9o1/n/5AtVD993xGgRk8u TXpEX0KrrsbWdwOjC4rMpIcRxPoKl/3La9WSPH9JDRCJZ8kXIUy9SqWDRduiuNSPTce3 Sh8dIbgkFqv1tVMXF0ntW18nRJdGqKDxucvC5EJ3e1l1gBa8w+noz2zGD/ODAeOmtVSS EtnPXjsUQSxXTopzqtSgvlQf8BoEeM4G7A5pKfeuoXFRmj8SKbHnl+xN4K8/yxeU9BHD M2b5iOWj/FkZ7sgQuI3h8xzUxJrBsAsuigJvj6ufPWL49q4mLnfU5a47vQbAatnf3jEk s11A== X-Gm-Message-State: ALoCoQkFx8+U8ewSL3Qdmy/oJggW3ujAkuo12IDdvFzPLeJwjPD5gVIcISOjdLSPOwlYeXs9SqvWwjJhVsskURuDA/NB3vO82DvWJT47SEZM+qs7Y0q1FgvFGuaUUXBsTDqyzCyTV+RM MIME-Version: 1.0 X-Received: by 10.180.90.65 with SMTP id bu1mr9220808wib.0.1440187418897; Fri, 21 Aug 2015 13:03:38 -0700 (PDT) Received: by 10.28.61.131 with HTTP; Fri, 21 Aug 2015 13:03:38 -0700 (PDT) In-Reply-To: References: Date: Fri, 21 Aug 2015 13:03:38 -0700 Message-ID: Subject: Re: LeaseExpiredExceptions and temp side effect files From: Everett Anderson To: user@crunch.apache.org Cc: Jeff Quinn Content-Type: multipart/alternative; boundary=f46d043bdf5a8384be051dd7c325 --f46d043bdf5a8384be051dd7c325 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hey, Jeff graciously agreed to try it out. I'm afraid we're still getting failures on that instance type, though with 0.11 with the patches, the cluster ended up in a state that no new applications could be submitted afterwards. The errors when running the pipeline seem to be similarly HDFS related. It's quite odd. Examples when using 0.11 + the patches: 2015-08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14401026= 43297_out0_0107_r_000001_0/out0-r-00001" - Aborting... 2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e.LeaseExpiredException): No lease on /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_144010264= 3297_out12_0103_r_000167_2/out12-r-00167 (inode 83784): File does not exist. [Lease. Holder: DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1, pendingcreates: 24] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem= .java:3516) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesyst= em.java:3486) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameN= odeRpcServer.java:687) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato= rPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa= menodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot= obufRpcEngine.java:635) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j= ava:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.ja= va:241) at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abando= nBlock(ClientNamenodeProtocolTranslatorPB.java:376) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp= l.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocat= ionHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHan= dler.java:102) at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= FSOutputStream.java:1377) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a:594) 2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_14401026= 43297_out12_0103_r_000167_2/out12-r-00167" - Aborting... 2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream= (DFSOutputStream.java:1472) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= FSOutputStream.java:1373) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a:594) 2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Abandoning BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268 2015-08-20 23:34:59,278 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010 2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= FSOutputStream.java:1386) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a:594) 2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_14401026= 43297_out0_0107_r_000001_2/out0-r-00001" - Aborting... 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010 at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.j= ava:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j= ava:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166) Caused by: java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream= (DFSOutputStream.java:1472) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= FSOutputStream.java:1373) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a:594) On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills wrote: > Curious how this went. :) > > On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson > wrote: > >> Sure, let me give it a try. I'm going to take 0.11 and patch it with >> >> https://issues.apache.org/jira/browse/CRUNCH-553 >> https://issues.apache.org/jira/browse/CRUNCH-517 >> >> as we also rely on 517. >> >> >> >> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills wrote: >> >>> (In particular, I'm wondering if something in CRUNCH-481 is related to >>> this problem.) >>> >>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills wrote= : >>> >>>> Hey Everett, >>>> >>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553 >>>> patch? Is that easy to do? >>>> >>>> J >>>> >>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I verified that the pipeline succeeds on the same cc2.8xlarge hardwar= e >>>>> when setting crunch.max.running.jobs to 1. I generally feel like the >>>>> pipeline application itself logic is sound, at this point. It could b= e that >>>>> this is just taxing these machines too hard and we need to increase t= he >>>>> number of retries? >>>>> >>>>> It reliably fails on this hardware when crunch.max.running.jobs set >>>>> to its default. >>>>> >>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as >>>>> well as how Crunch uses side effect files? Do you know if HDFS would = clean >>>>> up those directories from underneath Crunch? >>>>> >>>>> There are usually 4 failed applications, failing due to reduces. The >>>>> failures seem to be one of the following three kinds -- (1) No lease = on >>>>> , (2) File not found file, (3= ) >>>>> SocketTimeoutException. >>>>> >>>>> Examples: >>>>> >>>>> [1] No lease exception >>>>> >>>>> Error: org.apache.crunch.CrunchRuntimeException: >>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.n= amenode.LeaseExpiredException): >>>>> No lease on >>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_14399= 17295505_out7_0018_r_000003_1/out7-r-00003: >>>>> File does not exist. Holder >>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not = have >>>>> any open files. at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSName= system.java:2944) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInter= nal(FSNamesystem.java:3008) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNa= mesystem.java:2988) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(Nam= eNodeRpcServer.java:641) >>>>> at >>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra= nslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) >>>>> at >>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl= ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java= ) >>>>> at >>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal= l(ProtobufRpcEngine.java:599) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at >>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at >>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at >>>>> java.security.AccessController.doPrivileged(Native Method) at >>>>> javax.security.auth.Subject.doAs(Subject.java:415) at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1548) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at >>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskCon= text.java:74) >>>>> at >>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.jav= a:64) >>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at >>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656= ) at >>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at >>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at >>>>> java.security.AccessController.doPrivileged(Native Method) at >>>>> javax.security.auth.Subject.doAs(Subject.java:415) at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1548) >>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused= by: >>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.n= amenode.LeaseExpiredException): >>>>> No lease on >>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_14399= 17295505_out7_0018_r_000003_1/out7-r-00003: >>>>> File does not exist. Holder >>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not = have >>>>> any open files. at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSName= system.java:2944) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInter= nal(FSNamesystem.java:3008) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNa= mesystem.java:2988) >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(Nam= eNodeRpcServer.java:641) >>>>> at >>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra= nslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) >>>>> at >>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl= ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java= ) >>>>> at >>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal= l(ProtobufRpcEngine.java:599) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at >>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at >>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at >>>>> java.security.AccessController.doPrivileged(Native Method) at >>>>> javax.security.auth.Subject.doAs(Subject.java:415) at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1548) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at >>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at >>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at >>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng= ine.java:215) >>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.= java:57) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces= sorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:606) at >>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI= nvocationHandler.java:190) >>>>> at >>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat= ionHandler.java:103) >>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at >>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.= complete(ClientNamenodeProtocolTranslatorPB.java:404) >>>>> at >>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.j= ava:2130) >>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:= 2114) >>>>> at >>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOut= putStream.java:72) >>>>> at >>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java= :105) >>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1= 289) >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.clo= se(SequenceFileOutputFormat.java:87) >>>>> at >>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.ja= va:300) >>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) a= t >>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskCon= text.java:72) >>>>> ... 9 more >>>>> >>>>> >>>>> [2] File does not exist >>>>> >>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apac= he.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report fro= m attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRun= timeException: Could not read runtime node information >>>>> at org.apache.crunch.impl.mr.run.CrunchTaskContext.(CrunchTask= Context.java:48) >>>>> at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.j= ava:40) >>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172) >>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java= :656) >>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) >>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:415) >>>>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInf= ormation.java:1548) >>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) >>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/c= runch-4694113/p470/REDUCE >>>>> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFil= e.java:65) >>>>> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFil= e.java:55) >>>>> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat= ionsUpdateTimes(FSNamesystem.java:1726) >>>>> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat= ionsInt(FSNamesystem.java:1669) >>>>> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat= ions(FSNamesystem.java:1649) >>>>> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocat= ions(FSNamesystem.java:1621) >>>>> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlock= Locations(NameNodeRpcServer.java:497) >>>>> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSid= eTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorP= B.java:322) >>>>> at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProto= s$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.= java) >>>>> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker= .call(ProtobufRpcEngine.java:599) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) >>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) >>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:415) >>>>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInf= ormation.java:1548) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) >>>>> >>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Met= hod) >>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConst= ructorAccessorImpl.java:57) >>>>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegat= ingConstructorAccessorImpl.java:45) >>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>>> at org.apache.hadoop.ipc.RemoteException.instantiateException(Remote= Exception.java:106) >>>>> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remot= eException.java:73) >>>>> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.= java:1147) >>>>> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:= 1135) >>>>> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:= 1125) >>>>> at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLas= tBlockLength(DFSInputStream.java:273) >>>>> at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.jav= a:240) >>>>> at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:= 233) >>>>> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298) >>>>> at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distributed= FileSystem.java:300) >>>>> at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(Distributed= FileSystem.java:296) >>>>> at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLin= kResolver.java:81) >>>>> at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFile= System.java:296) >>>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) >>>>> at org.apache.crunch.util.DistCache.read(DistCache.java:72) >>>>> at org.apache.crunch.impl.mr.run.CrunchTaskContext.(CrunchTask= Context.java:46) >>>>> ... 9 more >>>>> >>>>> [3] SocketTimeoutException >>>>> >>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeo= utException: 70000 millis timeout while waiting for channel to be ready for= read. ch : java.nio.channels.SocketChannel[connected local=3D/10.55.1.229:= 35720 remote=3D/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTa= skContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.r= un.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapred= uce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.ru= nNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run= (ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild= .java:175) at java.security.AccessController.doPrivileged(Native Method) at= javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.se= curity.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apa= che.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.So= cketTimeoutException: 70000 millis timeout while waiting for channel to be = ready for read. ch : java.nio.channels.SocketChannel[connected local=3D/10.= 55.1.229:35720 remote=3D/10.55.1.230:9200] at org.apache.hadoop.net.SocketI= OWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.So= cketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.S= ocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.= SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputSt= ream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(Filt= erInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPr= efixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataSt= reamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOut= putStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:10= 42) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineFor= AppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOu= tputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at o= rg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java= :491) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills >>>>>> wrote: >>>>>> >>>>>>> Hey Everett, >>>>>>> >>>>>>> Initial thought-- there are lots of reasons for lease expired >>>>>>> exceptions, and their usually more symptomatic of other problems in= the >>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on t= he >>>>>>> non-SSD instances are failing for some other reason? I'd be surpris= ed if no >>>>>>> other errors showed up in the app master, although there are report= s of >>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're= not >>>>>>> doing that here, right? >>>>>>> >>>>>> >>>>>> We're reading from and writing to HDFS, here. (We've copied in input >>>>>> from S3 to HDFS in another step.) >>>>>> >>>>>> There are a few exceptions in the logs. Most seem related to missing >>>>>> temp files. >>>>>> >>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set to >>>>>> 1 to try to narrow down the originating failure. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I recently started trying to run our Crunch pipeline on more data >>>>>>>> and have been trying out different AWS instance types in anticipat= ion of >>>>>>>> our storage and compute needs. >>>>>>>> >>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched >>>>>>>> with the CRUNCH-553 >>>>>>>> fix). >>>>>>>> >>>>>>>> Our pipeline finishes fine in these cluster configurations: >>>>>>>> >>>>>>>> - 50 c3.4xlarge Core, 0 Task >>>>>>>> - 10 c3.8xlarge Core, 0 Task >>>>>>>> - 25 c3.8xlarge Core, 0 Task >>>>>>>> >>>>>>>> However, it always fails on the same data when using 10 cc2.8xlarg= e >>>>>>>> Core instances. >>>>>>>> >>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges >>>>>>>> use hard disks instead of SSDs. >>>>>>>> >>>>>>>> While it's a little hard to track down the exact originating >>>>>>>> failure, I think it's from errors like: >>>>>>>> >>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] >>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: >>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited : >>>>>>>> org.apache.crunch.CrunchRuntimeException: >>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.serve= r.namenode.LeaseExpiredException): >>>>>>>> No lease on >>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_= 1439499407003_out7_0028_r_000153_1/out7-r-00153: >>>>>>>> File does not exist. Holder >>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does n= ot have >>>>>>>> any open files. >>>>>>>> >>>>>>>> Those paths look like these side effect files >>>>>>>> >>>>>>>> . >>>>>>>> >>>>>>>> Would Crunch have generated applications that depend on side effec= t >>>>>>>> paths as input across MapReduce applications and something in HDFS= is >>>>>>>> cleaning up those paths, unaware of the higher level dependencies?= AWS >>>>>>>> configures Hadoop differently for each instance type, and might ha= ve more >>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed >>>>>>>> hypothesis. >>>>>>>> >>>>>>>> A sample full log is attached. >>>>>>>> >>>>>>>> Thanks for any guidance! >>>>>>>> >>>>>>>> - Everett >>>>>>>> >>>>>>>> >>>>>>>> *DISCLAIMER:* The contents of this email, including any >>>>>>>> attachments, may contain information that is confidential, proprie= tary in >>>>>>>> nature, protected health information (PHI), or otherwise protected= by law >>>>>>>> from disclosure, and is solely for the use of the intended recipie= nt(s). If >>>>>>>> you are not the intended recipient, you are hereby notified that a= ny use, >>>>>>>> disclosure or copying of this email, including any attachments, is >>>>>>>> unauthorized and strictly prohibited. If you have received this em= ail in >>>>>>>> error, please notify the sender of this email. Please delete this = and all >>>>>>>> copies of this email from your system. Any opinions either express= ed or >>>>>>>> implied in this email and all attachments, are those of its author= only, >>>>>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Director of Data Science >>>>>>> Cloudera >>>>>>> Twitter: @josh_wills >>>>>>> >>>>>> >>>>>> >>>>> >>>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>>> may contain information that is confidential, proprietary in nature, >>>>> protected health information (PHI), or otherwise protected by law fro= m >>>>> disclosure, and is solely for the use of the intended recipient(s). I= f you >>>>> are not the intended recipient, you are hereby notified that any use, >>>>> disclosure or copying of this email, including any attachments, is >>>>> unauthorized and strictly prohibited. If you have received this email= in >>>>> error, please notify the sender of this email. Please delete this and= all >>>>> copies of this email from your system. Any opinions either expressed = or >>>>> implied in this email and all attachments, are those of its author on= ly, >>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera >>>> Twitter: @josh_wills >>>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera >>> Twitter: @josh_wills >>> >> >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protect= ed >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not t= he >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of th= is >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. >> > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --=20 *DISCLAIMER:* The contents of this email, including any attachments, may=20 contain information that is confidential, proprietary in nature, protected= =20 health information (PHI), or otherwise protected by law from disclosure,=20 and is solely for the use of the intended recipient(s). If you are not the= =20 intended recipient, you are hereby notified that any use, disclosure or=20 copying of this email, including any attachments, is unauthorized and=20 strictly prohibited. If you have received this email in error, please=20 notify the sender of this email. Please delete this and all copies of this= =20 email from your system. Any opinions either expressed or implied in this=20 email and all attachments, are those of its author only, and do not=20 necessarily reflect those of Nuna Health, Inc. --f46d043bdf5a8384be051dd7c325 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hey,

Jeff graciously agreed to try it o= ut.

I'm afraid we're still getting failure= s on that instance type, though with 0.11 with the patches, the cluster end= ed up in a state that no new applications could be submitted afterwards.

The errors when running the pipeline seem to be simi= larly HDFS related. It's quite odd.

Examples w= hen using 0.11 + the patches:


2015-= 08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient: Could= not get block locations. Source file "/tmp/crunch-274499863/p504/outp= ut/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-= r-00001" - Aborting...


2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop= .hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.Rem= oteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):= No lease on /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/atte= mpt_1440102643297_out12_0103_r_000167_2/out12-r-00167 (inode 83784): File d= oes not exist. [Lease.=C2=A0 Holder: DFSClient_attempt_1440102643297_0103_r= _000167_2_964529009_1, pendingcreates: 24]
at org.apache.hadoop.hdfs.server= .namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
at org.apache.had= oop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)<= /div>
= at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(Na= meNodeRpcServer.java:687)
at org.apache.hadoop.hdfs.protocolPB.ClientNameno= deProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerS= ideTranslatorPB.java:467)
at org.apache.hadoop.hdfs.protocol.proto.ClientNa= menodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientName= nodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$= ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
at org.apache.hadoop.= ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1= .run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server= .java:2035)
at java.security.AccessController.doPrivileged(Native Method)
a= t javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.= security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.a= pache.hadoop.ipc.Server$Handler.run(Server.java:2033)

<= div> at org= .apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.C= lient.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Inv= oker.invoke(ProtobufRpcEngine.java:241)
at com.sun.proxy.$Proxy13.abandonBl= ock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProt= ocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)<= /div>
= at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.refle= ct.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43= )
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.= io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:18= 7)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat= ionHandler.java:102)
at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)=
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStre= am(DFSOutputStream.java:1377)
at org.apache.hadoop.hdfs.DFSOutputStream$Dat= aStreamer.run(DFSOutputStream.java:594)
2015-08-20 22:39:42,184 W= ARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: Could not get block locat= ions. Source file "/tmp/crunch-274499863/p510/output/_temporary/1/_tem= porary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167" - Ab= orting...


<= br>
2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hado= op.hdfs.DFSClient: Exception in createBlockOutputStream
java.io.I= OException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at org.apache.hadoop.hdfs.DFSOutputStream$= DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
<= span class=3D"Apple-tab-span" style=3D"white-space:pre"> at org.apac= he.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutput= Stream.java:1373)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.ru= n(DFSOutputStream.java:594)
2015-08-20 23:34:59,276 INFO [Thread-= 37] org.apache.hadoop.hdfs.DFSClient: Abandoning BP-835517662-10.55.1.32-14= 40102626965:blk_1073828261_95268
2015-08-20 23:34:59,278 INFO [Th= read-37] org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
2015-08-20 23:34:59= ,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient: DataStreamer Except= ion
java.io.IOException: Unable to create new block.
at org.apach= e.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputS= tream.java:1386)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run= (DFSOutputStream.java:594)
2015-08-20 23:34:59,278 WARN [Thread-3= 7] org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source = file "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attemp= t_1440102643297_out0_0107_r_000001_2/out0-r-00001" - Aborting...
=
2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild= : Exception running child : org.apache.crunch.CrunchRuntimeException: java.= io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at org.apache.crunch.impl.mr.run.Crunc= hTaskContext.cleanup(CrunchTaskContext.java:74)
at org.apache.crunch.impl.m= r.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.map= reduce.Reducer.run(Reducer.java:195)
at org.apache.hadoop.mapred.ReduceTask= .runNewReducer(ReduceTask.java:656)
at org.apache.hadoop.mapred.ReduceTask.= run(ReduceTask.java:394)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnCh= ild.java:171)
at java.security.AccessController.doPrivileged(Native Method)=
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop= .security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.= apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
Caused by= : java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at org.apache.hadoop.hdfs.DFSOu= tputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)<= /div>
= at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStrea= m(DFSOutputStream.java:1373)
at org.apache.hadoop.hdfs.DFSOutputStream$Data= Streamer.run(DFSOutputStream.java:594)


<= /div>





<= /div>


On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <= ;jwills@cloudera.c= om> wrote:
Curious how this went. :)

On Tue, Aug 18, 2015= at 4:26 PM, Everett Anderson <everett@nuna.com> wrote:
Sure, let me give it a try. I= 'm going to take 0.11 and patch it with


as we= also rely on 517.



On Tue, Aug 18, 2015 a= t 4:09 PM, Josh Wills <jwills@cloudera.com> wrote:
(In particular, I'm wonderi= ng if something in CRUNCH-481 is related to this problem.)
<= div class=3D"gmail_extra">
On Tue, Aug 18, 20= 15 at 4:07 PM, Josh Wills <jwills@cloudera.com> wrote:
=
Hey Everett,

=
Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553= patch? Is that easy to do?

J
On Tue, Aug 18, 2015 at 3:18 PM, Everett Anders= on <everett@nuna.com> wrote:
Hi,

I verified that the pipeline succ= eeds on the same cc2.8xlarge hardware when setting crunch.max.running.jobs to 1. I ge= nerally feel like the pipeline application itself logic is sound, at this p= oint. It could be that this is just taxing these machines too hard and we n= eed to increase the number of retries?

It reliably= fails on this hardware when crunch.max.running.jobs set to its default.
Can you explain a little what the /tmp/crunch-XXXXXXX files ar= e as well as how Crunch uses side effect files? Do you know if HDFS would c= lean up those directories from underneath Crunch?

= There are usually 4 failed applications, failing due to reduces. The failur= es seem to be one of the following three kinds -- (1) No lease on <side = effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3) S= ocketTimeoutException.

Examples:

[1] No lease exception

Error: org.apache.crunch.CrunchRuntimeException: org.= apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.Le= aseExpiredException): No lease on /tmp/crunch-4694113/p662/output/_temporar= y/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: Fil= e does not exist. Holder DFSClient_attempt_1439917295505_0018_r_000003_1_82= 4053899_1 does not have any open files. at org.apache.hadoop.hdfs.server.na= menode.FSNamesystem.checkLease(FSNamesystem.java:2944) at org.apache.hadoop= .hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3= 008) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FS= Namesystem.java:2988) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpc= Server.complete(NameNodeRpcServer.java:641) at org.apache.hadoop.hdfs.proto= colPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeP= rotocolServerSideTranslatorPB.java:484) at org.apache.hadoop.hdfs.protocol.= proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMet= hod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpc= Engine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) at org.ap= ache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Serv= er$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$= 1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Nati= ve Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apa= che.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:154= 8) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apa= che.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)= at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:= 64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apa= che.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apac= he.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.m= apred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController= .doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.ja= va:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn= formation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.j= ava:170) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop= .hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/crunch-46941= 13/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_00= 0003_1/out7-r-00003: File does not exist. Holder DFSClient_attempt_14399172= 95505_0018_r_000003_1_824053899_1 does not have any open files. at org.apac= he.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:29= 44) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInte= rnal(FSNamesystem.java:3008) at org.apache.hadoop.hdfs.server.namenode.FSNa= mesystem.completeFile(FSNamesystem.java:2988) at org.apache.hadoop.hdfs.ser= ver.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) at org.= apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.= complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) at org.apac= he.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodePr= otocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apach= e.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEn= gine.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at or= g.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.ha= doop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessCont= roller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subj= ect.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserG= roupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Serv= er.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org= .apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.P= rotobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215) at com.sun.prox= y.$Proxy13.complete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl= .invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Nati= veMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.i= nvoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.inv= oke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.i= nvokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.= RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.p= roxy.$Proxy13.complete(Unknown Source) at org.apache.hadoop.hdfs.protocolPB= .ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslat= orPB.java:404) at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOu= tputStream.java:2130) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOu= tputStream.java:2114) at org.apache.hadoop.fs.FSDataOutputStream$PositionCa= che.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputS= tream.close(FSDataOutputStream.java:105) at org.apache.hadoop.io.SequenceFi= le$Writer.close(SequenceFile.java:1289) at org.apache.hadoop.mapreduce.lib.= output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87) a= t org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:3= 00) at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at = org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.j= ava:72) ... 9 more


[2] File does not exist<= /span>

=
2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handle=
r] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics=
 report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunc=
h.CrunchRuntimeException: Could not read runtime node information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTask=
Context.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40=
)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-=
4694113/p470/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java=
:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUp=
dateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsIn=
t(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(F=
SNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocati=
ons(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTrans=
latorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java=
:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Clie=
ntNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(=
ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor=
AccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon=
structorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExcept=
ion.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExcep=
tion.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1=
147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlock=
Length(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:=
233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSy=
stem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResol=
ver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem=
.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTask=
Context.java:46)
	... 9 more
[3] SocketTimeoutException
Er=
ror: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutExcept=
ion: 70000 millis timeout while waiting for channel to be ready for read. c=
h : java.nio.channels.SocketChannel[connected local=3D/10.55.1.229:35720 remote=3D/10.55.1.230:9200] at org=
.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java=
:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.j=
ava:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org=
.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.=
apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hado=
op.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessContro=
ller.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subjec=
t.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro=
upInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChi=
ld.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeo=
ut while waiting for channel to be ready for read. ch : java.nio.channels.S=
ocketChannel[connected local=3D/10.55.1.229:35720 remote=3D/10.55.1.230:9200] at org.apache.hadoop.net.Socke=
tIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.=
SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net=
.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.ne=
t.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInput=
Stream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(Fi=
lterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vint=
Prefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$Data=
Streamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSO=
utputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:=
1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineF=
orAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFS=
OutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at=
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.ja=
va:491)




<= /div>





<= /div>



On Fri, Aug 14, 2015 at 3:54 PM, Everett = Anderson <everett@nuna.com> wrote:


On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwill= s@cloudera.com> wrote:
Hey Everett,

Initial thought-- there are lo= ts of reasons for lease expired exceptions, and their usually more symptoma= tic of other problems in the pipeline. Are you sure none of the jobs in the= Crunch pipeline on the non-SSD instances are failing for some other reason= ? I'd be surprised if no other errors showed up in the app master, alth= ough there are reports of some weirdness around LeaseExpireds when writing = to S3-- but you're not doing that here, right?
=

We're reading from and writing to HDFS, here= . (We've copied in input from S3 to HDFS in another step.)
There are a few exceptions in the logs. Most seem related to m= issing temp files.

Let me see if I can reproduce i= t with=C2=A0cr= unch.max.running.jobs set to 1 to try to narrow down the originating fa= ilure.


=C2=A0

J

On Fri, Aug 14, 2= 015 at 2:10 PM, Everett Anderson <everett@nuna.com> wrote:
Hi,<= div>
I recently started trying to run our Crunch pipeline on = more data and have been trying out different AWS instance types in anticipa= tion of our storage and compute needs.

I was using= EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the CRUNCH-55= 3 fix).

Our pipeline finishes fine in these cl= uster configurations:
  • 50 c3.4xlarge Core, 0 Task
  • = 10 c3.8xlarge Core, 0 Task
  • 25 c3.8xlarge Core, 0 Task
However, it always fails on the same data when using 10 cc2.8xlarge = Core instances.

The biggest obvious hardware diffe= rence is that the cc2.8xlarges use hard disks instead of SSDs.
While it's a little hard to track down the exact originati= ng failure, I think it's from errors like:

201= 5-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] org.apache.hado= op.mapred.TaskAttemptListenerImpl: Task: attempt_1439499407003_0028_r_00015= 3_1 - exited : org.apache.crunch.CrunchRuntimeException: org.apache.hadoop.= ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredExce= ption): No lease on /tmp/crunch-970849245/p662/output/_temporary/1/_tempora= ry/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153: File does not e= xist. Holder DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 do= es not have any open files.

Those paths look l= ike these side effect files.

<= div>Would Crunch have generated applications that depend on side effect pat= hs as input across MapReduce applications and something in HDFS is cleaning= up those paths, unaware of the higher level dependencies? AWS configures H= adoop differently for each instance type, and might have more aggressive cl= eanup settings on HDs, though this is very uninformed hypothesis.

A sample full log is attached.

Tha= nks for any guidance!

- Everett


DISCLAIMER:=C2= =A0The contents of this email, including any attachments, may contain infor= mation that is confidential, proprietary in nature, protected health inform= ation (PHI), or otherwise protected by law from disclosure, and is solely f= or the use of the intended recipient(s). If you are not the intended recipi= ent, you are hereby notified that any use, disclosure or copying of this em= ail, including any attachments, is unauthorized and strictly prohibited. If= you have received this email in error, please notify the sender of this em= ail. Please delete this and all copies of this email from your system. Any = opinions either expressed or implied in this email and all attachments, are= those of its author only, and do not necessarily reflect those of Nuna Hea= lth, Inc.


--
Director of Data Science=



DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.



--
Director of Data Science
Twitter: @josh_wills
<= /div>



--
=
Director of Data Science
Twitter: @josh_wills


DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.


--
Director of Data Science
Twitter: @josh_wills
<= /div>


DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc. --f46d043bdf5a8384be051dd7c325--