spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shea Parkes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
Date Sat, 05 Aug 2017 00:59:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115176#comment-16115176
] 

Shea Parkes commented on SPARK-19116:
-------------------------------------

Apologies for not responding earlier.  I'm struggling to recall all the nuances of my original
issue here, but your response does seem to be an acceptable answer.

It's not intuitive to me, and we've pretty much gone over to explicit broadcast joining in
all of our production work, but if this is as intended, then I'm fine closing this down...

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -----------------------------------------------------------------
>
>                 Key: SPARK-19116
>                 URL: https://issues.apache.org/jira/browse/SPARK-19116
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.1, 2.0.2
>         Environment: Python 3.5.x
> Windows 10
>            Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and I've been
chasing them through the join heuristics in the catalyst engine.  I've made it as far as I
can, and I've hit upon something that does not make any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just use the
sum of default sizeInBytes for each column times the number of rows.  But this trivial example
shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: file:/c:/scratch/hundred_integers.parquet,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
> {code}
> So basically, I'm not understanding well how the size of the parquet file is being estimated.
 I don't expect it to be extremely accurate, but empirically it's so inaccurate that we're
having to mess with autoBroadcastJoinThreshold way too much.  (It's not always too high like
the example above, it's often way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 to be a
bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message