Dear Niels,
thanks for the quick response. So in your opinion there is nothing like a "hadoop embedded"
tool to do this. This is what I suspected indeed.
Since the <key, value> pairs in the initial dataset are randomly partitioned in the
input files, I suppose that I can avoid the initial statistic step and do statistics "on the
fly".
For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every mapper will
receive, on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs and so it
will know that the <key,value> to be chunked are de As. This can introduce some errors
but it should produce a output which is quite uniformly distributed.
Thanks again!
Luca
On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote:
> If I understand your problem correctly you actually need some way of
> knowing if you need to "chop" a large set with a specific key in to
> subsets.
> In mapreduce the map only has information about a single key at a
> time. So you need something extra.
>
> One way of handling this is to start by doing a statistical run first
> and use the output to determine which keys can be chopped.
>
> So first do a MR to determine per key how many records and/or bytes
> need to be handled.
> In the reducer you have a "lower limit" that ensures the reducer only
> outputs the keys that need to be chopped.
> In the second pass you do your actual algorithm with one twist: In the
> mapper you use the output of the first run to determine if the key
> needs to be rewritten and in how many variants. You then randomly (!)
> choose a variant.
>
> Example of what i mean:
> Assume you have 100 records with key A and 10000 records with key B.
> You determine that you want key groups of an approximate maximum of
> 1000 records.
> The first pass outputs that key B must be split into 10 variants (=10000/1000).
> Then the second MAP will transform a key B into one of these randomly:
> B1, B2, B3 ...
> A record with key A will remain unchanged.
>
> Disadvantage:
> Each run will show different results.
> Only works if the set of keys that needs to be chopped is small
> enough so you can have it in memory in the call to the second map.
>
> HTH
>
> Niels Basjes
>
>
> 2011/3/10 Luca Aiello <alucca@yahooinc.com>:
>> Dear users,
>>
>> hope this is the right list to submit this one, otherwise I apologize.
>>
>> I'd like to have your opinion about a problem that I'm facing on MapReduce framework.
I am writing my code in Java and running on a grid.
>>
>> I have a textual input structured in <key, value> pairs. My task is to make
the cartesian product of all the values that have the same key.
>> I can do so it simply using <key> as my map key, so that every value with the
same key is put in the same reducer, where I can easily process them and obtain the cartesian.
>>
>> However, my keys are not uniformly distributed, the distribution is very broad. As
a result of this, the output of my reducers will be very unbalanced and I will have many small
files (some KB) and a bunch of huge files (tens of GB). A suboptimal yet acceptable approximation
of my task would be to make the cartesian product of smaller chunks of values for very frequent
keys, so that the load is distributed evenly among reducers. I am wondering how can I do this
in the most efficient/elegant way.
>>
>> It appears to me that using a customized Partitioner is not the right way to act,
since records with the same key have still to be mapped together (am I right?).
>> The only solution that comes into my mind is to split the key space artificially
insider the mapper (e.g., for a frequent key "ABC", map the values on the reducers using keys
like "ABC1", "ABC2", an so on). This would require an additional postprocessing cleanup phase
to retrieve the original keys.
>>
>> Do you know a better, possibly automatic way to perform this task?
>>
>> Thank you!
>> Best
>>
>> Luca
>>
>>
>
>
>
> 
> Met vriendelijke groeten,
>
> Niels Basjes
