Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 14 Nov 2016 09:53:58 +0000 (UTC)
From: "Sean Owen (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13020142.1478882176000.267410.1479117238343@Atlassian.JIRA>
In-Reply-To: <JIRA.13020142.1478882176000@Atlassian.JIRA>
References: <JIRA.13020142.1478882176000@Atlassian.JIRA> <JIRA.13020142.1478882176082@arcas>
Subject: [jira] [Commented] (SPARK-18414) sc.textFile doesn't seem to use
 LzoTextInputFormat when hadoop-lzo is installed
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 14 Nov 2016 09:54:00 -0000


    [ https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663327#comment-15663327 ] 

Sean Owen commented on SPARK-18414:
-----------------------------------

I suppose it depends on how common this is. Core Hadoop and therefore Spark already support common compression codecs out of the box, and I think Spark would just inherit Hadoop's support unless there was a big reason to add something further. In this case, if it requires GPL code, Spark can't ship it directly anyway. You however can add it to your app if you like, and that seems like it might be sufficient given the use case now.

> sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18414
>                 URL: https://issues.apache.org/jira/browse/SPARK-18414
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.1
>            Reporter: Renan Vicente Gomes da Silva
>            Priority: Minor
>              Labels: hadoop-lzo
>
> When reading LZO files using sc.textFile it miss a few files from time to time.
> Sample:
>       val Data = sc.textFile(Files)
>       listFiles += Data.count()
> Considering that Files is a HDFS directory containing LZO files. If executed for example a 1000 times it gets different results a few times.
> Now if you use newAPIHadoopFile to force it to use com.hadoop.mapreduce.LzoTextInputFormat it works perfectly, shows the same results in all executions.
> Sample:
>       val Data = sc.newAPIHadoopFile(Files,
>         classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>         classOf[org.apache.hadoop.io.LongWritable],
>         classOf[org.apache.hadoop.io.Text]).map(_._2.toString)
>       listFiles += Data.count()
> Looking at Spark code it looks like it use TextInputFormat by default but is not using com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is installed.
> https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org