hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elton sky <eltonsky9...@gmail.com>
Subject Re: Applications creates bigger output than input?
Date Sat, 30 Apr 2011 04:31:21 GMT
Thank you for suggestions:

Weblog analysis, market basket analysis and generating search index.

I guess for these applications we need more reduces than maps, for handling
large intermediate output, isn't it. Besides, the input split for map should
be smaller than usual,  because the workload for spill and merge on map's
local disk is heavy.


On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <omalley@apache.org> wrote:

> On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9404@gmail.com> wrote:
> > For my benchmark purpose, I am looking for some non-trivial, real life
> > applications which creates *bigger* output than its input. Trivial
> example
> > I
> > can think about is cross join...
> >
> As you say, almost all cross join jobs have that property. The other case
> that almost always fits into that category is generating an index. For
> example, if your input is a corpus of documents and you want to generate
> the
> list of documents that contain each word, the output (and especially the
> shuffle data) is much larger than the input.
> -- Owen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message