hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Applications creates bigger output than input?
Date Sat, 30 Apr 2011 01:22:24 GMT
On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9404@gmail.com> wrote:

> For my benchmark purpose, I am looking for some non-trivial, real life
> applications which creates *bigger* output than its input. Trivial example
> I
> can think about is cross join...
>

As you say, almost all cross join jobs have that property. The other case
that almost always fits into that category is generating an index. For
example, if your input is a corpus of documents and you want to generate the
list of documents that contain each word, the output (and especially the
shuffle data) is much larger than the input.

-- Owen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message