hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Moritz Krog <moritzk...@googlemail.com>
Subject Hadoop Streaming
Date Wed, 14 Jul 2010 07:35:25 GMT
Hi everyone,

I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
I'm getting started with Hadoop streaming and python mapper and reducer.
>From what I read in the mapreduce tutorial, mapper an reducer can be plugged
into Hadoop via the "-mapper" and "-reducer" options on job start. I was
wondering what the input for the reducer would look like, so I ran a Hadoop
job using my own mapper but /bin/cat as reducer. As you can see, the output
of the job is ordered, but the keys haven't been combined:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   107488
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95562

I would have expected something like:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560, 95562, 107488

my understanding from the tutorial was, that this reduction is a part of the
shuffle and sort phase. Or do I need to use a combiner to get that done?
Does Hadoop streaming even do this, or do I need to use a native java class?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message