hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Meagher <john.meag...@gmail.com>
Subject Re: Applications creates bigger output than input?
Date Fri, 29 Apr 2011 13:12:33 GMT
Another case is augmenting data.  This is sometimes done outside of MR
in an ETL flow, but can be done as an MR job.  Doing something like
this is using Hadoop to handle the scaling issues, but really isn't
what MR is intended for.

A real example of this is:

* Input: standard apache weblog
* Data added...
  - Geolocation of IP
  - Decoding URL
  - Adding information based on visited URL / Ref URL ...
  - Adding information based on the user
* Output complex binary object to a sequence file

On Fri, Apr 29, 2011 at 08:02, elton sky <eltonsky9404@gmail.com> wrote:
> One of assumptions map reduce made, I think, is that size of map's output is
> smaller than input. Although we can see many applications have the same size
> of output with input, like, sort, merge,etc.
> For my benchmark purpose, I am looking for some non-trivial, real life
> applications which creates *bigger* output than its input. Trivial example I
> can think about is cross join...
> I really appreciate if you share your knowledge with me.

View raw message