hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. P. Pendleton" ...@geekdom.net>
Subject Re: What do people use Hadoop for?
Date Thu, 25 Jan 2007 22:39:20 GMT
Seems like it wouldn't be more expensive than a few calls to the appropriate
Comparator to figure this out - the OutputCollector merely compares each
output key to the previously output key. If order is preserved, output this
extra truth when the "spill" to disk happens as a header field. If not, you
can stop calling the comparator as soon as output fails to be ordered a
single time. In any case, this means that sorts can be skipped on any output
sequences that are already sorted, and only applied to output sequences that
aren't.

On 1/25/07, Doug Judd <doug@zvents.com> wrote:
>
> Thinking about this a little more, the RDBMS join example is not such a
> good
> one, since you have to sort by foreign key anyway, and you can do this
> sort
> and merge as a single normal Map-reduce job.  However, there are cases
> where
> you know that the output of the map() phase is already "sorted".  You
> should
> be able to set a flag telling the mapred framework that this is the case
> so
> that in the Reduce phase, the files are simply merged together in one pass
> instead of having to do two passes, first verifying that the input is
> sorted
> and a second pass to do the merge.
>
> - Doug
>
> On 1/25/07, Doug Judd <doug@zvents.com> wrote:
> >
> > Comments inline ...
> >
> > On 1/25/07, Bryan A. P. Pendleton <bp@geekdom.net> wrote:
> > >
> > > If the output is already sorted, the sort pass *should* be able to run
> > > in
> > > linear time - perhaps not worth optimizing it out for cases of sorted
> > > output.
> >
> >
> > Agreed.  There's no reason why the framework can't detect this and
> > automatically skip the n log(n) sort.
> >
> > Given the scatter/reassemble nature of the default map/reduce
> > > (scatter by Partition, by default by the Hash), inputs that are sorted
> > > may
> > > not be written as such to output..... so, if you're counting on sorted
> > > data,
> > > maybe it's best to leave the sort in (and verify that the current
> > > infrastructure will perform well given sorted input). Otherwise, if
> > > there is
> > > no implication/need of sorted output, then sort can be totally
> disabled.
> >
> >
> > I guess when I say "sorted" I really mean aggregated.  If you know the
> > input is "aggregated" (e.g. it's the output of a previous Map/Reduce
> job)
> > and the map() function preserves this aggregation, then the step of
> building
> > the intermediate results from the map output should be done in linear
> time.
> >
> > I do feel that this optimization is important.  As Adrezej points out,
> > there are quite a few applications that could benefit from this.  In
> > particular, the join example.  Without it, you can't really justify
> > replacing your large RDBMS system with a Map-Reduce (Bigtable)
> > implementation.
> >
> >  - Doug
> >
> >
> >
> >
>
>


-- 
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message