spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
Date Fri, 21 Aug 2015 21:31:45 GMT

     [ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yin Huai reassigned SPARK-10143:
--------------------------------

    Assignee: Yin Huai

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Critical
>             Fix For: 1.5.0
>
>
> When Parquet's task side metadata is enabled (by default it is enabled and it needs to
be enabled to deal with tables with many files), Parquet delegates the work of calculating
initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
If filesystem's block size is smaller than the row group size and users do not set min split
size, splits in the initial split list will have lots of dummy splits and they contribute
to empty tasks (because the starting point and ending point of a split does not cover the
starting point of a row group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message