hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Harris <>
Subject RE: DISTRIBUTE BY question
Date Mon, 13 Jul 2015 20:55:16 GMT
this should get you on the right path:

From: Connell Donaghy []
Sent: Monday, July 13, 2015 2:50 PM
Subject: DISTRIBUTE BY question

Hey! I'm trying to write a tool which uses a storagehandler to store HFiles, using a specific
partition function. So in order to do this, I have been trying to use DISTRIBUTE BY and a
UDF using the key column and number of reducers (which becomes number of partitions, as each
reducer creates its own hfile.) However, I have noticed that sometimes two UDF values (say
0 and 11) will both go to reducer 0, while reducer 11 does not get any inputs. Could you guys
point me to the place in your source code where you implement the partitioning for the map/reduce
job and DISTRIBUTE BY, so that I could try and reverse-engineer it to ensure the keys go to
the right partition? If my question doesn't make sense, just pointing me to where DISTRIBUTE
BY is implemented would be very helpful, and thank you so so much for your time!

information that is privileged and exempt from disclosure under applicable law. If you are
neither the intended recipient nor responsible for delivering the message to the intended
recipient, please note that any dissemination, distribution, copying or the taking of any
action in reliance upon the message is strictly prohibited. If you have received this communication
in error, please notify the sender immediately.  Thank you.
View raw message