hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Halter <akh...@nyu.edu>
Subject Re: map-reduce-related school project help
Date Mon, 26 Nov 2012 04:13:14 GMT
Hi, I am working with Randy on this. I know one reason for this project is
that actual usage data from big companies that use map reduce suggest that
a high percentage of jobs are small.


On Sun, Nov 25, 2012 at 9:55 PM, rshepherd <rjs471@nyu.edu> wrote:

> Hi Jiang, thanks for your response.
>
> I think the idea would be to be able to use the map-reduce programming
> paradigm on small, local jobs. In other words, provide a way to take
> existing  jobs that are running in a distributed fashion and run them
> against the machine-local version. Part of the purpose is educational,
> and intended to illustrate the way that map-reduce is implemented and
> the trade-offs that are present. I hope this clarifies things.
>
> On 11/25/12 9:54 PM, sampanriver@gmail.com wrote:
> > Hi Randy,
> > The intermediate key-value pairs are not written to HDFS. They are
> written to the local file system. Besides, if the job is "small", why do
> you use the MapReduce? You can just do it on a local machine.
> >
> > Jiang Shan
> >
> >
> >
> >
> >
> > From: rshepherd
> > Date: 2012-11-26 09:38
> > To: mapreduce-dev
> > Subject: map-reduce-related school project help
> > Hi everybody,
> >
> > I am a student at NYU and am evaluating an idea for final project for a
> > distributed systems class. The idea is roughly as follows; the overhead
> > for running map-reduce on a 'small' job is high. (A small job would be
> > defined as something fitting in memory on a single machine.) Can
> > hadoop's map-reduce be modified to be efficient for jobs such as this?
> >
> > It seems that one way to do begin to achieve this goal would be to
> > modify the way the intermediate key-value pairs are handled, the
> > "handoff" from the map to the reduce. Rather than writing them to HDFS,
> > either pass them directly to a reducer or keep them in memory in a data
> > structure. Using a single, shared hashmap would alleviate the need to
> > sort the mapper output. Instead perhaps distribute the slots to a
> > reducer or reducers on multiple threads. My hope is that, as this is a
> > simplification of distributed  map-reduce, it will be relatively
> > straightforward to alter the code to in-memory approach for smaller jobs
> > that would perform very well for this special case.
> >
> > I was hoping that someone on the list could help me with the following
> > questions:
> >
> > 1) Does this sound like a good idea that might be achievable in a few
> weeks?
> > 2) Does my intuition about how to achieve the goal seem reasonable?
> > 3) If so, any advice on now to navigate the code base? (Any pointers on
> > packages/classes of interest would be highly appreciated)
> > 4) Any other feedback?
> >
> > Thanks in advance to anyone willing and able to help!
> > Randy
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message