spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Performance of loading parquet files into case classes in Spark
Date Tue, 30 Aug 2016 18:16:33 GMT

On 29 Aug 2016, at 20:58, Julien Dumazert <<>>

Hi Maciek,

I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is
what I got:

// Dataset with RDD-style code
// 34.223s[A].map(_.fieldToSum).reduce(_ + _)

// Dataset with map and Dataframes sum
// 35.372s[A].map(_.fieldToSum).agg(sum("value")).collect().head.getAs[Long](0)

Not much of a difference. It seems that as soon as you access data as in RDDs, you force the
full decoding of the object into a case class, which is super costly.

I find this behavior quite normal: as soon as you provide the user with the ability to pass
a blackbox function, anything can happen, so you have to load the whole object. On the other
hand, when using SQL-style functions only, everything is "white box", so Spark understands
what you want to do and can optimize.

SWL and the dataframe code where you are asking for a specific field can be handled by the
file format itself, so optimising the operation. If you ask for only one column of Parquet
and orc data, then only that column's data should be loaded. And because they store columns
together, you save on all the IO needed to read all the discarded columns. Add even more selectiveness
(such as ranges in values), then you can even get "predicate pushdown" where blocks of the
file are skipped if the input format reader can determine that none of the columns there match
the predicate's conditions.

you should be able to ge away with something like"field").... to filter out the
fields you want first, then stay in code rather than SQL.

Anyway, experiment: its always more accurate than the opinions of others, especially when
applied to your own datasets.

View raw message