spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Ricard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17495) Hive hash implementation
Date Tue, 04 Apr 2017 19:59:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955714#comment-15955714
] 

Dominic Ricard commented on SPARK-17495:
----------------------------------------

[~tejasp] We use murmur3 hash internally in some of our data pipelines, non-SQL, and I would
like to know if the goal of this task to expose a new UDF (ex: murmur3()), similar to md5()
and hash()?

I believe that would be the best approach to preserve compatibility with previously generated
data and queries using hash().



> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from the one used
by Hive. For queries which use bucketing this leads to different results if one tries the
same query on both engines. For us, we want users to have backward compatibility to that one
can switch parts of applications across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message