spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark
Date Mon, 10 Sep 2018 18:24:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609638#comment-16609638
] 

Dongjoon Hyun edited comment on SPARK-12417 at 9/10/18 6:23 PM:
----------------------------------------------------------------

This is fixed since 2.0.0.
{code}
scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", "*").orc("/tmp/orc200")

$ hive --orcfiledump /tmp/orc200/part-r-00007-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 0 section BLOOM_FILTER start: 14 length 426
    Stream: column 1 section ROW_INDEX start: 440 length 24
    Stream: column 1 section BLOOM_FILTER start: 464 length 456
    Stream: column 2 section ROW_INDEX start: 920 length 24
    Stream: column 2 section BLOOM_FILTER start: 944 length 449
    Stream: column 1 section DATA start: 1393 length 6
    Stream: column 2 section DATA start: 1399 length 6
...
{code}


was (Author: dongjoon):
This is fixed since 2.0.0.
{code}
scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", "*").orc("/tmp/orc200")
{code}
$ hive --orcfiledump /tmp/orc200/part-r-00007-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 0 section BLOOM_FILTER start: 14 length 426
    Stream: column 1 section ROW_INDEX start: 440 length 24
    Stream: column 1 section BLOOM_FILTER start: 464 length 456
    Stream: column 2 section ROW_INDEX start: 920 length 24
    Stream: column 2 section BLOOM_FILTER start: 944 length 449
    Stream: column 1 section DATA start: 1393 length 6
    Stream: column 2 section DATA start: 1399 length 6
...
{code}

> Orc bloom filter options are not propagated during file write in spark
> ----------------------------------------------------------------------
>
>                 Key: SPARK-12417
>                 URL: https://issues.apache.org/jira/browse/SPARK-12417
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Rajesh Balamohan
>            Assignee: Apache Spark
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. However, when
trying to create orc file with bloom filter option, it does not make use of it.
> E.g, following orc output does not create the bloom filter even though the options are
specified.
> {noformat}
>     Map<String, String> orcOption = new HashMap<String, String>();
>     orcOption.put("orc.bloom.filter.columns", "*");
>     hiveContext.sql("select * from accounts where effective_date='2015-12-30'").write().
>         format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message