Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A3D42324 for ; Tue, 3 May 2011 15:43:41 +0000 (UTC) Received: (qmail 20511 invoked by uid 500); 3 May 2011 15:43:40 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 20450 invoked by uid 500); 3 May 2011 15:43:40 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 20442 invoked by uid 99); 3 May 2011 15:43:40 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 May 2011 15:43:40 +0000 Received: from localhost (HELO mail-gx0-f176.google.com) (127.0.0.1) (smtp-auth username omalley, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 May 2011 15:43:40 +0000 Received: by gxk7 with SMTP id 7so104369gxk.35 for ; Tue, 03 May 2011 08:43:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.44.198 with SMTP id b6mr7161128qcf.67.1304437418743; Tue, 03 May 2011 08:43:38 -0700 (PDT) Received: by 10.229.235.131 with HTTP; Tue, 3 May 2011 08:43:38 -0700 (PDT) In-Reply-To: References: <48C82C46-B359-4E70-A107-D1E4521A6905@yahoo-inc.com> Date: Tue, 3 May 2011 08:43:38 -0700 Message-ID: Subject: Re: Why mergeParts() is not parallel with collect() on map? From: "Owen O'Malley" To: common-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016364184fff9d88c04a2610054 --0016364184fff9d88c04a2610054 Content-Type: text/plain; charset=UTF-8 On Tue, May 3, 2011 at 1:48 AM, elton sky wrote: > Pls correct me if I am wrong. One of the important assumptions of hadoop > map > reduce is: map's output should be smaller than input. No, that isn't a valid assumption. MapReduce workloads can roughly be divided into three categories: 1. scans (map input > shuffle data) 2. sorts (map input = shuffle data = output data) 3. index builds ( map input < shuffle data) Scans are the most common, but far from the only case. -- Owen --0016364184fff9d88c04a2610054--