spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18414) sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed
Date Mon, 14 Nov 2016 09:53:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663327#comment-15663327
] 

Sean Owen commented on SPARK-18414:
-----------------------------------

I suppose it depends on how common this is. Core Hadoop and therefore Spark already support
common compression codecs out of the box, and I think Spark would just inherit Hadoop's support
unless there was a big reason to add something further. In this case, if it requires GPL code,
Spark can't ship it directly anyway. You however can add it to your app if you like, and that
seems like it might be sufficient given the use case now.

> sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18414
>                 URL: https://issues.apache.org/jira/browse/SPARK-18414
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.1
>            Reporter: Renan Vicente Gomes da Silva
>            Priority: Minor
>              Labels: hadoop-lzo
>
> When reading LZO files using sc.textFile it miss a few files from time to time.
> Sample:
>       val Data = sc.textFile(Files)
>       listFiles += Data.count()
> Considering that Files is a HDFS directory containing LZO files. If executed for example
a 1000 times it gets different results a few times.
> Now if you use newAPIHadoopFile to force it to use com.hadoop.mapreduce.LzoTextInputFormat
it works perfectly, shows the same results in all executions.
> Sample:
>       val Data = sc.newAPIHadoopFile(Files,
>         classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>         classOf[org.apache.hadoop.io.LongWritable],
>         classOf[org.apache.hadoop.io.Text]).map(_._2.toString)
>       listFiles += Data.count()
> Looking at Spark code it looks like it use TextInputFormat by default but is not using
com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is installed.
> https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message