hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Suggestions of proper usage of "key" parameter ?
Date Mon, 15 Dec 2008 17:45:49 GMT

On Dec 15, 2008, at 9:06 AM, Ricky Ho wrote:

> Choice 1:  Emit one entry in the reduce(), using doc_name as key
> ==================================================================
> In this case, the output will still be a single entry per invocation  
> of reduce() ...
>
> {
>  "key":"file1",
>  "value":[["apple", 25],
>           ["orange", 16]]
> }

In general, you don't want to gather the inputs in memory, because  
that limits the scalability of your application to files that only  
have word counts that fit in memory. For word counts, it isn't a  
problem, but I've seen applications that take this approach end up  
with records where the large ones are 500mb for a single record.

> Choice 2:  Emit multiple entries in the reduce(), using [doc_name,  
> word] as key
> = 
> = 
> ======================================================================
> In this case, the output will have multiple entries per invocation  
> of reduce() ...
>
> {
>  "key":["file1", "apple"],
>  "value":25
> }
>
> {
>  "key":["file1", "orange"],
>  "value":16
> }

This looks pretty reasonable to me. Especially if the partitioning is  
on both the filename and word so that the loading between the reduces  
is relatively even.

> Choice 3:  Emit one entry in the reduce(), using null key
> ============================================================
> In this case, the output will be a single entry per invocation of  
> reduce() ...
>
> {
>  "key":null,
>  "value":[["file1", "apple", 25],
>           ["file1", "orange", 16]]
> }

I don't see any advantage to this one. It has all of the memory- 
limiting problems of option 1 and will of course do bad things if the  
down stream user isn't expecting null keys.

-- Owen

Mime
View raw message