hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarandeep Singh <tarand...@gmail.com>
Subject Re: Restrict output of mappers to reducers running on same node?
Date Thu, 18 Jun 2009 16:43:27 GMT
Jason, correct me if I am wrong-

opening Sequence file in the configure (or setup method in 0.20) and writing
to it is same as doing output.collect( ), unless you mean I should make the
sequence file writer static variable and set reuse jvm flag to -1. In that
case the subsequent mappers might be run in the same JVM and they can use
the same writer and hence produce one file. But in that case I need to add a
hook to close the writer - may be use the shutdown hook.

Jothi, the idea of combine input format is good. But I guess I have to write
somethign of my own to make it work in my case.

Thanks guys for the suggestions... but I feel we should have some support
from the framework to merge the output of mapper only job so that we don't
get a lot number of smaller files. Sometimes you just don't want to run
reducers and unnecessarily transfer a whole lot of data across the network.

Thanks,
Tarandeep

On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop <jason.hadoop@gmail.com>wrote:

> You can open your sequence file in the mapper configure method, write to it
> in your map, and close it in the mapper close method.
> Then you end up with 1 sequence file per map. I am making an assumption
> that
> each key,value to your map some how represents a single xml file/item.
>
> On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <jothipn@yahoo-inc.com
> >wrote:
>
> > You could look at CombineFileInputFormat to generate a single split out
> of
> > several files.
> >
> > Your partitioner would be able to assign keys to specific reducers, but
> you
> > would not have control on which node a given reduce task will run.
> >
> > Jothi
> >
> >
> > On 6/18/09 5:10 AM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Can I restrict the output of mappers running on a node to go to
> > reducer(s)
> > > running on the same node?
> > >
> > > Let me explain why I want to do this-
> > >
> > > I am converting huge number of XML files into SequenceFiles. So
> > > theoretically I don't even need reducers, mappers would read xml files
> > and
> > > output Sequencefiles. But the problem with this approach is I will end
> up
> > > getting huge number of small output files.
> > >
> > > To avoid generating large number of smaller files, I can Identity
> > reducers.
> > > But by running reducers, I am unnecessarily transfering data over
> > network. I
> > > ran some test case using a small subset of my data (~90GB). With map
> only
> > > jobs, my cluster finished conversion in only 6 minutes. But with map
> and
> > > Identity reducers job, it takes around 38 minutes.
> > >
> > > I have to process close to a terabyte of data. So I was thinking of a
> > faster
> > > alternatives-
> > >
> > > * Writing a custom OutputFormat
> > > * Somehow restrict output of mappers running on a node to go to
> reducers
> > > running on the same node. May be I can write my own partitioner
> (simple)
> > but
> > > not sure how Hadoop's framework assigns partitions to reduce tasks.
> > >
> > > Any pointers ?
> > >
> > > Or this is not possible at all ?
> > >
> > > Thanks,
> > > Tarandeep
> >
> >
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message