spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Armbrust (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4963) SchemaRDD.sample may return wrong results
Date Tue, 30 Dec 2014 18:34:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261331#comment-14261331
] 

Michael Armbrust commented on SPARK-4963:
-----------------------------------------

Mutability is an internal optimization and we always copy at boundaries where we expose data
to the user.  We should not remove it from parquet or hive table scan because it greatly improves
performance.

> SchemaRDD.sample may return wrong results
> -----------------------------------------
>
>                 Key: SPARK-4963
>                 URL: https://issues.apache.org/jira/browse/SPARK-4963
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Cheng Lian
>            Assignee: Yanbo Liang
>
> This {{sbt/sbt hive/console}} session can easily reproduce this issue:
> {code}
> sql("SELECT * FROM src WHERE key % 2 = 0").
>   sample(withReplacement = false, fraction = 0.05).
>   registerTempTable("sampled")
> println(table("sampled").queryExecution)
> val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
> println(query.queryExecution)
> // Should print `true'
> println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
> {code}
> Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used to do the
sampling. My guess is that there’s something to do with the underlying mutable row objects
used in {{HiveTableScan}}, but haven't figured out the root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message