spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shea Parkes (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
Date Sat, 07 Jan 2017 14:35:58 GMT
Shea Parkes created SPARK-19116:
-----------------------------------

             Summary: LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
                 Key: SPARK-19116
                 URL: https://issues.apache.org/jira/browse/SPARK-19116
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.0.2, 2.0.1
         Environment: Python 3.5.x
Windows 10
            Reporter: Shea Parkes


We're having some modestly severe issues with broadcast join inference, and I've been chasing
them through the join heuristics in the catalyst engine.  I've made it as far as I can, and
I've hit upon something that does not make any sense to me.

I thought that loading from parquet would be a RelationPlan, which would just use the sum
of default sizeInBytes for each column times the number of rows.  But this trivial example
shows that I am not correct:

{code}
import pyspark.sql.functions as F

df_range = session.range(100).select(F.col('id').cast('integer'))
df_range.write.parquet('c:/scratch/hundred_integers.parquet')

df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
df_parquet.explain(True)

# Expected sizeInBytes
integer_default_sizeinbytes = 4
print(df_parquet.count() * integer_default_sizeinbytes)  # = 400

# Inferred sizeInBytes
print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318

# For posterity (Didn't really expect this to match anything above)
print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
{code}

And here's the results of explain(True) on df_parquet:
{code}
In [456]: == Parsed Logical Plan ==
Relation[id#794] parquet

== Analyzed Logical Plan ==
id: int
Relation[id#794] parquet

== Optimized Logical Plan ==
Relation[id#794] parquet

== Physical Plan ==
*BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: file:/c:/scratch/hundred_integers.parquet,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
{code}

So basically, I'm not understanding well how the size of the parquet file is being estimated.
 I don't expect it to be extremely accurate, but empirically it's so inaccurate that we're
having to mess with autoBroadcastJoinThreshold way too much.  (It's not always too high like
the example above, it's often way too low.)

Without deeper understanding, I'm considering a result of 2318 instead of 400 to be a bug.
 My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message