hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Ross <r...@semanticresearch.com>
Subject Re: I keep getting multiple values for unique reduce keys
Date Mon, 05 Sep 2011 05:14:52 GMT
Thanks, but unless I misread you, that didn't do it.     Naturally the object that I am creating
just has a couple of ArrayLists to gather up Name and Type objects.   

I suspect I need to extend ArrayWritable instead.   I'll try that next.  

Cheers.

R

On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

> Hi,
> 
> I suspect it's something to do with your custom Writable. Do you have a clear method
on your container? If so, that should be used before the obj is initialized every time to
avoid retaining previous values due to object reuse during ser-de process.
> 
> Thanks
> Sudhan S
> 
> 
> 
> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <rick@semanticresearch.com> wrote:
> Hi all,
> 
> I have ensured that my mapper produces a unique key for every value it writes and further
more that each map() call only writes one value.    I note here that the value is a custom
for which I handle the Writable interface methods.
> 
> I realize that it isn't very real world to have (well, want) no combining done prior
to reducing, but I'm still getting my feet wet.
> 
> When the reducer runs, I expected to see one reduce() call for every map() call, and
I do.    However, the value I get is the composite of all the reduce() calls that came before
it.
> 
> So, for example, the mapper gets data like this :
> 
>   ID,     Name,          Type,          Other stuff...
>   A000,   Cream,         Group,         ...
>   B231,   Led Zeppelin,  Group,         ...
>   A044,   Liberace,      Individual,    ...
> 
> 
> ID is the external key from the source data and is guaranteed to be unique.
> 
> When I map it, I create a container for the row data and output that container with all
the data from that row only and use the ID field as a key.
> 
> Since the key is always unique I expected the sort/shuffle step to never coalesce any
two values.    So I expected my reduce() method to be called once per mapped input row, and
it is.
> 
> The problem is, as each row is processed, the reducer sees a set of cumulative value
data instead of a container with a row of data in it.  So the 'value' parameter to reduce
always has the information from previous reduce steps.
> 
> For example, given the data above :
> 
> 1st Reducer Call :
>   Key = A000
>   Value =
>       Container :
>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
> 
> 2nd Reducer Call :
>   Key = B231
>   Value =
>       Container :
>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
> 
> So the second reduce call has data in it from the first reduce call.   Very strange!
  At a guess I would say the reducer is re-using the object when it reads the objects back
from the mapping step.  I dunno..
> 
> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
> 
> Thanks!
> 
> R
> 
> 
> 
> 


Mime
View raw message