hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: Restrict output of mappers to reducers running on same node?
Date Thu, 18 Jun 2009 02:57:16 GMT
You can open your sequence file in the mapper configure method, write to it
in your map, and close it in the mapper close method.
Then you end up with 1 sequence file per map. I am making an assumption that
each key,value to your map some how represents a single xml file/item.

On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <jothipn@yahoo-inc.com>wrote:

> You could look at CombineFileInputFormat to generate a single split out of
> several files.
>
> Your partitioner would be able to assign keys to specific reducers, but you
> would not have control on which node a given reduce task will run.
>
> Jothi
>
>
> On 6/18/09 5:10 AM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
>
> > Hi,
> >
> > Can I restrict the output of mappers running on a node to go to
> reducer(s)
> > running on the same node?
> >
> > Let me explain why I want to do this-
> >
> > I am converting huge number of XML files into SequenceFiles. So
> > theoretically I don't even need reducers, mappers would read xml files
> and
> > output Sequencefiles. But the problem with this approach is I will end up
> > getting huge number of small output files.
> >
> > To avoid generating large number of smaller files, I can Identity
> reducers.
> > But by running reducers, I am unnecessarily transfering data over
> network. I
> > ran some test case using a small subset of my data (~90GB). With map only
> > jobs, my cluster finished conversion in only 6 minutes. But with map and
> > Identity reducers job, it takes around 38 minutes.
> >
> > I have to process close to a terabyte of data. So I was thinking of a
> faster
> > alternatives-
> >
> > * Writing a custom OutputFormat
> > * Somehow restrict output of mappers running on a node to go to reducers
> > running on the same node. May be I can write my own partitioner (simple)
> but
> > not sure how Hadoop's framework assigns partitions to reduce tasks.
> >
> > Any pointers ?
> >
> > Or this is not possible at all ?
> >
> > Thanks,
> > Tarandeep
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message