Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AFA01D3C2 for ; Wed, 26 Sep 2012 13:47:50 +0000 (UTC) Received: (qmail 25029 invoked by uid 500); 26 Sep 2012 13:47:49 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 24855 invoked by uid 500); 26 Sep 2012 13:47:48 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 24809 invoked by uid 99); 26 Sep 2012 13:47:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2012 13:47:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bk0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2012 13:47:38 +0000 Received: by bkwj5 with SMTP id j5so480154bkw.1 for ; Wed, 26 Sep 2012 06:47:18 -0700 (PDT) Received: by 10.204.146.13 with SMTP id f13mr543007bkv.29.1348667238215; Wed, 26 Sep 2012 06:47:18 -0700 (PDT) Received: from [130.149.225.102] (poodle-6.dima.tu-berlin.de. [130.149.225.102]) by mx.google.com with ESMTPS id hy11sm2532205bkc.5.2012.09.26.06.47.16 (version=SSLv3 cipher=OTHER); Wed, 26 Sep 2012 06:47:16 -0700 (PDT) Message-ID: <50630762.3040709@apache.org> Date: Wed, 26 Sep 2012 15:47:14 +0200 From: Sebastian Schelter Reply-To: ssc@apache.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Re: Combiner applied on multiple map task outputs (like in Mahout SVD) References: In-Reply-To: X-Enigmail-Version: 1.4.4 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit If I understand the discussion correctly, there is some confusion here. A map task is not the same as a single invocation of the function to map. A map task consumes a split and invokes the function to map for each key-value pair contained in the split. The function to combine is applied (usually several times, in some implementation specific way) to the output of all the invocations of that map task. --sebastian On 26.09.2012 15:40, Sigurd Spieckermann wrote: > Well, my word selection wasn't great when I said "one map task produces > only a single result". The way I meant this was that one map task only > produces a single outer product (that consist of multiple column vectors > hence multiple mapper emits), but those are not the ones to combine in this > case, right? > > 2012/9/26 Sigurd Spieckermann > >> Yes, but one int/vector pair corresponds to the respective column of A >> multiplied by an element of the respective row of B, correct? So the >> concatenation of the resulting columns would be outer product of the column >> of A and the row of B. None of these vectors are summed up but rather the >> outer products of multiple map tasks are summed up. So what is the job of >> the combiner here? It would be nice if the combiner could sum up all outer >> products computed on that datanode, but this is the part I can't see >> happening in Hadoop. Is the general statement correct that a combiner is >> only applied to all outputs of a *map task* and that a map task processes >> all key-value pairs of a split? In this case, there is only one key-value >> pair per split, right? The int/vector being index and column/row of the >> matrix. >> >> >> 2012/9/26 Jake Mannix >> >>> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann < >>> sigurd.spieckermann@gmail.com> wrote: >>> >>>> Hi guys, >>>> >>>> I'm trying to understand the way the combiner in Mahout SVD works. ( >>>> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html) As far as I >>>> know from the Mahout math matrix-multiplication implementation, matrix >>> A is >>>> represented by column-vectors, matrix B is represented by row vectors >>> and >>>> an inner join executes an outer product of the columns of A with the >>> rows >>>> of B. All outer products are summed by the combiners and reducers. What >>> I >>>> am wondering about is how a combiner can actually combine multiple outer >>>> products on the same datanode because the join-package requires the >>> data to >>>> be partitioned into unsplittable files. In this case, I understand that >>> one >>>> file contains one column/row of its corresponding matrix. Hence, each >>> map >>>> task receives a column-row-tuple, computes the outer product and emits >>> the >>>> result. >>> >>> >>> This all sounds right, but not the following: >>> >>> >>>> My understanding of Hadoop is that the combiner follows a map task >>>> immediately but one map task produces only a single result so there is >>>> nothing to combine. >>> >>> >>> That part is not true - a mapper may emit more than one key-value pair >>> (and >>> for >>> matrix multiplication, this is true *a fortiori* - there is one int/vector >>> pair emitted per >>> nonzero element of the row being mapped over). >>> >>> >>>> If the combiner could accumulate the results of >>>> multiple map task, I would understand the idea, but from my >>> understanding >>>> and tests, it does not. >>>> >>>> Could anyone clarify the process please? >>>> >>>> Thanks a lot! >>>> Sigurd >>>> >>> >>> >>> >>> -- >>> >>> -jake >>> >> >> >