hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Why mergeParts() is not parallel with collect() on map?
Date Tue, 03 May 2011 15:43:38 GMT
On Tue, May 3, 2011 at 1:48 AM, elton sky <eltonsky9404@gmail.com> wrote:

> Pls correct me if I am wrong. One of the important assumptions of hadoop
> map
> reduce is: map's output should be smaller than input.


No, that isn't a valid assumption. MapReduce workloads can roughly be
divided into three categories:
1. scans (map input > shuffle data)
2. sorts (map input = shuffle data = output data)
3. index builds ( map input < shuffle data)

Scans are the most common, but far from the only case.

-- Owen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message