Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E78610762 for ; Sun, 29 Mar 2015 20:48:07 +0000 (UTC) Received: (qmail 88990 invoked by uid 500); 29 Mar 2015 20:48:05 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 88910 invoked by uid 500); 29 Mar 2015 20:48:05 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 88899 invoked by uid 99); 29 Mar 2015 20:48:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Mar 2015 20:48:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dale__r@hotmail.com designates 65.54.190.89 as permitted sender) Received: from [65.54.190.89] (HELO BAY004-OMC2S14.hotmail.com) (65.54.190.89) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Mar 2015 20:47:39 +0000 Received: from BAY180-W52 ([65.54.190.123]) by BAY004-OMC2S14.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Sun, 29 Mar 2015 13:47:15 -0700 X-TMN: [km8Tj/RnUl0edbIkDxmGAdJKBbOTx5GFtup4sBB0obU=] X-Originating-Email: [dale__r@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_2fa97c9c-48bf-4448-a80f-299d7b0bac0f_" From: Dale Richardson To: "dev@spark.apache.org" Subject: One corrupt gzip in a directory of 100s Date: Sun, 29 Mar 2015 20:47:14 +0000 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 29 Mar 2015 20:47:15.0612 (UTC) FILETIME=[8A29EDC0:01D06A61] X-Virus-Checked: Checked by ClamAV on apache.org --_2fa97c9c-48bf-4448-a80f-299d7b0bac0f_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Recently had an incident reported to me where somebody was analysing a dire= ctory of gzipped log files=2C and was struggling to load them into spark be= cause one of the files was corrupted - calling sc.textFiles('hdfs:///logs/*= .gz') caused an IOException on the particular executor that was reading tha= t file=2C which caused the entire job to be cancelled after the retry count= was exceeded=2C without any way of catching and recovering from the error.= While normally I think it is entirely appropriate to stop execution if so= mething is wrong with your input=2C sometimes it is useful to analyse what = you can get (as long as you are aware that input has been skipped)=2C and t= reat corrupt files as acceptable losses. To cater for this particular case I've added SPARK-6593 (PR at https://gith= ub.com/apache/spark/pull/5250). Which adds an option (spark.hadoop.ignoreIn= putErrors) to log exceptions raised by the hadoop Input format=2C but to co= ntinue on with the next task. Ideally in this case you would want to report the corrupt file paths back t= o the master so they could be dealt with in a particular way (eg moved to a= separate directory)=2C but that would require a public API change/addition= . I was pondering on an addition to Spark's hadoop API that could report pr= ocessing status back to the master via an optional accumulator that collect= s filepath/Option(exception message) tuples so the user has some idea of wh= at files are being processed=2C and what files are being skipped. Regards=2CDale. = --_2fa97c9c-48bf-4448-a80f-299d7b0bac0f_--