hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Applications creates bigger output than input?
Date Thu, 19 May 2011 12:57:52 GMT
Something I've seen in the past is code that has the input
and outputs

So the output number of records is the same as the length of the input text.


2011/5/19 elton sky <eltonsky9404@gmail.com>:
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle neck
> for them is shuffle and merge.
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <stevel@apache.org> wrote:
>> On 30/04/2011 05:31, elton sky wrote:
>>> Thank you for suggestions:
>>> Weblog analysis, market basket analysis and generating search index.
>>> I guess for these applications we need more reduces than maps, for
>>> handling
>>> large intermediate output, isn't it. Besides, the input split for map
>>> should
>>> be smaller than usual,  because the workload for spill and merge on map's
>>> local disk is heavy.
>> any form of rendering can generate very large images
>> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Met vriendelijke groeten,

Niels Basjes

View raw message