spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed
Date Tue, 24 Oct 2017 18:23:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217398#comment-16217398
] 

Steve Loughran commented on SPARK-22240:
----------------------------------------

no, spark 2.2 doesn't fix this. I  have to explicitly define the schema as the list of headers
-> String and then reader setup time drops to 6s. 
{code}
2017-10-24 19:19:31,593 [ScalaTest-main-running-S3ACommitBulkDataSuite] DEBUG fs.FsUrlStreamHandlerFactory
(FsUrlStreamHandlerFactory.java:createURLStreamHandler(107)) - Unknown protocol jar, delegating
to default implementation
2017-10-24 19:19:36,402 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  commit.S3ACommitBulkDataSuite
(Logging.scala:logInfo(54)) - Duration of set up initial .csv load = 6,195,721,969 nS
{code}

I'm not sure if this is related to the original bug, though it is potentially part of the
issue. As what I'm seeing is that either schema inference always takes place, or taking the
first line of a .gz file is enough to force reading the entire .gz source file

> S3 CSV number of partitions incorrectly computed
> ------------------------------------------------
>
>                 Key: SPARK-22240
>                 URL: https://issues.apache.org/jira/browse/SPARK-22240
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>            Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of partitions correctly
in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", "true").option("delimiter",
"|").option("multiLine", "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ...
36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", "true").option("delimiter",
"|").option("multiLine", "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ...
36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a property
that should be set to have the number of partitions computed correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 2.0.2, it's
not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message