spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
Date Fri, 21 Aug 2015 18:19:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707194#comment-14707194
] 

Yin Huai commented on SPARK-10143:
----------------------------------

oh, I meant the current value for the configuration is a much better heuristic to determine
the number of mappers than the default HDFS block size when HDFS block size is small. 

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it needs to
be enabled to deal with tables with many files), Parquet delegates the work of calculating
initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
If filesystem's block size is smaller than the row group size and users do not set min split
size, splits in the initial split list will have lots of dummy splits and they contribute
to empty tasks (because the starting point and ending point of a split does not cover the
starting point of a row group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message