Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F3CAB200BBF for ; Mon, 14 Nov 2016 10:53:59 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id F26A3160B0D; Mon, 14 Nov 2016 09:53:59 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4A3D8160B05 for ; Mon, 14 Nov 2016 10:53:59 +0100 (CET) Received: (qmail 96894 invoked by uid 500); 14 Nov 2016 09:53:58 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 96879 invoked by uid 99); 14 Nov 2016 09:53:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Nov 2016 09:53:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 54E902C0059 for ; Mon, 14 Nov 2016 09:53:58 +0000 (UTC) Date: Mon, 14 Nov 2016 09:53:58 +0000 (UTC) From: "Sean Owen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-18414) sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 14 Nov 2016 09:54:00 -0000 [ https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663327#comment-15663327 ] Sean Owen commented on SPARK-18414: ----------------------------------- I suppose it depends on how common this is. Core Hadoop and therefore Spark already support common compression codecs out of the box, and I think Spark would just inherit Hadoop's support unless there was a big reason to add something further. In this case, if it requires GPL code, Spark can't ship it directly anyway. You however can add it to your app if you like, and that seems like it might be sufficient given the use case now. > sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed > ------------------------------------------------------------------------------- > > Key: SPARK-18414 > URL: https://issues.apache.org/jira/browse/SPARK-18414 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.0.1 > Reporter: Renan Vicente Gomes da Silva > Priority: Minor > Labels: hadoop-lzo > > When reading LZO files using sc.textFile it miss a few files from time to time. > Sample: > val Data = sc.textFile(Files) > listFiles += Data.count() > Considering that Files is a HDFS directory containing LZO files. If executed for example a 1000 times it gets different results a few times. > Now if you use newAPIHadoopFile to force it to use com.hadoop.mapreduce.LzoTextInputFormat it works perfectly, shows the same results in all executions. > Sample: > val Data = sc.newAPIHadoopFile(Files, > classOf[com.hadoop.mapreduce.LzoTextInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Text]).map(_._2.toString) > listFiles += Data.count() > Looking at Spark code it looks like it use TextInputFormat by default but is not using com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is installed. > https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org