Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: <BANLkTikAmvGvi=A+9TjdfBG4KyEEG9=nTw@mail.gmail.com>
References: <BANLkTimqGv0g+D10+sKS05+yBEP2epfT-g@mail.gmail.com>
	<48C82C46-B359-4E70-A107-D1E4521A6905@yahoo-inc.com>
	<BANLkTikAmvGvi=A+9TjdfBG4KyEEG9=nTw@mail.gmail.com>
Date: Tue, 3 May 2011 08:43:38 -0700
Message-ID: <BANLkTi=2hj4JSC_S-NLMxjZFLG9frRAVpg@mail.gmail.com>
Subject: Re: Why mergeParts() is not parallel with collect() on map?
From: "Owen O'Malley" <omalley@apache.org>
To: common-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016364184fff9d88c04a2610054

--0016364184fff9d88c04a2610054
Content-Type: text/plain; charset=UTF-8

On Tue, May 3, 2011 at 1:48 AM, elton sky <eltonsky9404@gmail.com> wrote:

> Pls correct me if I am wrong. One of the important assumptions of hadoop
> map
> reduce is: map's output should be smaller than input.


No, that isn't a valid assumption. MapReduce workloads can roughly be
divided into three categories:
1. scans (map input > shuffle data)
2. sorts (map input = shuffle data = output data)
3. index builds ( map input < shuffle data)

Scans are the most common, but far from the only case.

-- Owen

--0016364184fff9d88c04a2610054--