hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Magalhaes <pedror...@gmail.com>
Subject Map-Side Join Conditions
Date Wed, 30 Jul 2014 23:32:31 GMT
I'm trying to do a map-side join using CompositeInputFormat. I reand int
book "Hadoop Definitive Guide"that I must follow certain conditions:
"Each input dataset must be divided into the same number of partitions, and
it must be sorted by the same key (the join key) in each source. All the
records for the private key must reside in the same partition. This may
sound like the strict requirement (and it is), but it actually fits the
description of the output of a MapReduce job. "

I really need to have all records from a particular key within the same
partition? Does Hadoop will assign a map task for each partition file?

I tried to meet these conditions using the ORDER BY from PIG latin, but the
function does not put all records with the same key within the same
partition.
http://stackoverflow.com/questions/21668974/apache-pig-does-order-by-with-parallel-ensure-consistent-hashing-distribution

How do I meet this condition? Do I need to create a Identity Mapper Reducer
job just to make this task ?

Thanks!!!

Mime
View raw message