hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Hi, Hive People urgent question about [Distribute By] function
Date Tue, 27 Oct 2015 17:49:30 GMT

> I want to override partitionByHash function on Flink like the same way
>of DBY on Hive.
> I am working on implementing some benchmark system for these two system,
>which could be contritbutino to Hive as well.

I would be very disappointed if Flink fails to outperform Hive with a
Distribute BY, because the hive version is about 5-10x slower than it can
be with Tez.

Mapreduce forces a full sort of the output data, so the Hive version will
be potentially O(N*LOG(N)) by default while Flink should be able to do

Assuming you don't turn on any of the compatibility modes, the hashCode
generated would be a murmur hash after encoding data into a byte[] using
BinarySortableSerDe & the data is then sorted using
key=(murmur_hash(byte[]) % n-reducers).

The reducers then pull the data, merge-sort using the disk which is
entirely wasted CPU.

If you or anyone's interested in fixing this for Tez, I have a JIRA open
to the fix the hash-only shuffle -


View raw message