spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MIchael Davies (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files
Date Mon, 19 Jan 2015 18:15:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282790#comment-14282790
] 

MIchael Davies edited comment on SPARK-5309 at 1/19/15 6:15 PM:
----------------------------------------------------------------

Looking at Parquet code - it looks like hooks are already in place to support reducing conversion
overhead by taking advantage of dictionaries.

In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary
for this purpose. These are not used by CatalystPrimitiveConverter.

I will get a PR together covering this as query performance savings can be substantial



was (Author: michael davies):
Looking at Parquet code - it looks like hooks are already in place to support this.

In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary
for this purpose.

These are not used by CatalystPrimitiveConverter.


> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5309
>                 URL: https://issues.apache.org/jira/browse/SPARK-5309
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: MIchael Davies
>            Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant proportion
of query times.
> For columns which have repeated String values (which is common) the same Binary will
be repeatedly being converted. 
> A simple change to cache the last converted String per column was shown to reduce query
times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary encoding/decoding
over to Parquet so that it could ensure that this was done only once per Binary value. 
> Next step is to look at Parquet code and to discuss with that project, which I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message