hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Lee <philjj...@gmail.com>
Subject Re: Hi, Hive People urgent question about [Distribute By] function
Date Sat, 24 Oct 2015 22:59:15 GMT
Hello, the same question about DISTRIBUTE BY on Hive.

Accorring to you, you do not use hashCode of Object class on DBY,
Distribute By.

I tried to understand how ObjectInspectorUtils works for distribution, but
it seemed it has a lot of Hive API. It is not much understnading.
I want to override partitionByHash function on Flink like the same way of
DBY on Hive.
I am working on implementing some benchmark system for these two system,
which could be contritbutino to Hive as well.

Could you tell me in detail how it works?
I am pretty sure if you do not user hashCode of Object class in Java, you
defined the partition function for DBY.

Regards,
Philip Lee


On Thu, Oct 22, 2015 at 7:13 PM, Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

>
> > so do you think if we want the same result from Hive and Spark or the
> >other freamwork, how could we try this one ?
>
> There's a special backwards compat slow codepath that gets triggered if
> you do
>
> set mapred.reduce.tasks=199; (or any number)
>
> This will produce the exact same hash-code as the java hashcode for
> Strings & Integers.
>
> The bucket-id is determined by
>
> (hashCode & Integer.MAX_VALUE) % numberOfBuckets
>
> but this also triggers a non-stable sort on an entirely empty key, which
> will shuffle the data so the output file's order bears no resemblance to
> the input file's order.
>
>
> Even with that setting, the only consistent layout produced by Hive is the
> CLUSTER BY, which will sort on the same key used for distribution & uses
> the java hashCode if the auto-parallelism is turned off by setting a fixed
> reducer count.
>
> Cheers,
> Gopal
>
>
>


-- 

==========================================================

*Hae Joon Lee*


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Mime
View raw message