spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kay Ousterhout (JIRA)" <>
Subject [jira] [Closed] (SPARK-5906) Input read size incorrect for Parquet files
Date Thu, 19 Feb 2015 07:35:11 GMT


Kay Ousterhout closed SPARK-5906.
    Resolution: Won't Fix

Ah I see -- then I'm closing this, because I'm using Hadoop 2.0, which is why this was an
issue (so it's totally unrelated to Parquet -- it just didn't surface with other file formats
where Spark reads the whole file to do a count).  Thanks for the help Sandy!

> Input read size incorrect for Parquet files
> -------------------------------------------
>                 Key: SPARK-5906
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Web UI
>    Affects Versions: 1.2.1
>            Reporter: Kay Ousterhout
>            Priority: Minor
> When SparkSQL reads input data from parquet, there are many cases where it doesn't need
to read the whole file (e.g., to do a count(*), it only needs to read metadata).  Spark reports
the input size as the entire file size, even though SparkSQL didn't read nearly that much.
> [~sandyr] do you know why this is? I'm seeing this when Spark uses the new hadoop API,
so the bytes read come from Hadoop's statistics data.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message