hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Applications creates bigger output than input?
Date Sat, 30 Apr 2011 01:22:24 GMT
On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9404@gmail.com> wrote:

> For my benchmark purpose, I am looking for some non-trivial, real life
> applications which creates *bigger* output than its input. Trivial example
> I
> can think about is cross join...

As you say, almost all cross join jobs have that property. The other case
that almost always fits into that category is generating an index. For
example, if your input is a corpus of documents and you want to generate the
list of documents that contain each word, the output (and especially the
shuffle data) is much larger than the input.

-- Owen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message