hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Efficiently partition broadly distributed keys
Date Thu, 10 Mar 2011 22:07:59 GMT
Hi Luca,

2011/3/10 Luca Aiello <alucca@yahoo-inc.com>:

> thanks for the quick response. So in your opinion there is nothing like a "hadoop embedded"
tool to do this. This is what I suspected indeed.

The mapreduce model simply uses the <key> as the pivot of the
processing. In your application specific situation your looking for a
way to "break" that model in a smart way. So no, Hadoop doesn't do

> Since the <key, value> pairs in the initial dataset are randomly partitioned in
the input files,
> I suppose that I can avoid the initial statistic step and do statistics "on the fly".
> For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every mapper
will receive,
> on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs
> and so it will know that the <key,value> to be chunked are de As.

If you know this in advance you can surely put that into the the
mapper. But that simply means you've done the statistics in advance in
some way.
I fail to see how you could do this "on the fly" without prior knowledge.

As a slightly smarter way of doing something like this; Have a look at
the terasort example.
There a sample of the data is used to estimate the distribution.

> This can introduce some errors but it should produce a output which is quite uniformly
> Thanks again!

You're welcome.


> On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote:
>> If I understand your problem correctly you actually need some way of
>> knowing if you need to "chop" a large set with a specific key in to
>> subsets.
>> In mapreduce the map only has information about a single key at a
>> time. So you need something extra.
>> One way of handling this is to start by doing a statistical run first
>> and use the output to determine which keys can be chopped.
>> So first do a MR to determine per key how many records and/or bytes
>> need to be handled.
>> In the reducer you have a "lower limit" that ensures the reducer only
>> outputs the keys that need to be chopped.
>> In the second pass you do your actual algorithm with one twist: In the
>> mapper you use the output of the first run to determine if the key
>> needs to be rewritten and in how many variants. You then randomly (!)
>> choose a variant.
>> Example of what i mean:
>> Assume you have 100 records with key A and 10000 records with key B.
>> You determine that you want key groups of an approximate maximum of
>> 1000 records.
>> The first pass outputs that key B must be split into 10 variants (=10000/1000).
>> Then the second MAP will transform a key B into one of these randomly:
>> B-1, B-2, B-3 ...
>> A record with key A will remain unchanged.
>> Disadvantage:
>>   Each run will show different results.
>>   Only works if the set of keys that needs to be chopped is small
>> enough so you can have it in memory in the call to the second map.
>> HTH
>> Niels Basjes
>> 2011/3/10 Luca Aiello <alucca@yahoo-inc.com>:
>>> Dear users,
>>> hope this is the right list to submit this one, otherwise I apologize.
>>> I'd like to have your opinion about a problem that I'm facing on MapReduce framework.
I am writing my code in Java and running on a grid.
>>> I have a textual input structured in <key, value> pairs. My task is to
make the cartesian product of all the values that have the same key.
>>> I can do so it simply using <key> as my map key, so that every value with
the same key is put in the same reducer, where I can easily process them and obtain the cartesian.
>>> However, my keys are not uniformly distributed, the distribution is very broad.
As a result of this, the output of my reducers will be very unbalanced and I will have many
small files (some KB) and a bunch of huge files (tens of GB). A sub-optimal yet acceptable
approximation of my task would be to make the cartesian product of smaller chunks of values
for very frequent keys, so that the load is distributed evenly among reducers. I am wondering
how can I do this in the most efficient/elegant way.
>>> It appears to me that using a customized Partitioner is not the right way to
act, since records with the same key have still to be mapped together (am I right?).
>>> The only solution that comes into my mind is to split the key space artificially
insider the mapper (e.g., for a frequent key "ABC", map the values on the reducers using keys
like "ABC1", "ABC2", an so on). This would require an additional post-processing cleanup phase
to retrieve the original keys.
>>> Do you know a better, possibly automatic way to perform this task?
>>> Thank you!
>>> Best
>>> Luca
>> --
>> Met vriendelijke groeten,
>> Niels Basjes

Met vriendelijke groeten,

Niels Basjes

View raw message