Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com
 designates 209.85.214.42 as permitted sender)
Message-ID: <50630762.3040709@apache.org>
Date: Wed, 26 Sep 2012 15:47:14 +0200
From: Sebastian Schelter <ssc@apache.org>
Reply-To: ssc@apache.org
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:15.0) Gecko/20120827 Thunderbird/15.0
MIME-Version: 1.0
To: user@mahout.apache.org
Subject: Re: Combiner applied on multiple map task outputs (like in Mahout
 SVD)
References: 
 <CABVuHiixiHGx1wJMhknMmDNUog2BgFhr54z_Wi+FBZmBRak0hA@mail.gmail.com>
 <CACYXym--HWjA_U=OA5hc-Y0jUp878LBafYWPSJRfMtz70LG07w@mail.gmail.com>
 <CABVuHihWDBSJSXaL4bxZoSwJRFOdYL+f4mEUwkz+u_6esge8og@mail.gmail.com>
 <CABVuHig-0PFfWe1f4HMW_ae7S031NTg18zTvmxpLZU3GFKh+eg@mail.gmail.com>
In-Reply-To: 
 <CABVuHig-0PFfWe1f4HMW_ae7S031NTg18zTvmxpLZU3GFKh+eg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

If I understand the discussion correctly, there is some confusion here.

A map task is not the same as a single invocation of the function to map.

A map task consumes a split and invokes the function to map for each
key-value pair contained in the split. The function to combine is
applied (usually several times, in some implementation specific way) to
the output of all the invocations of that map task.

--sebastian

On 26.09.2012 15:40, Sigurd Spieckermann wrote:
> Well, my word selection wasn't great when I said "one map task produces
> only a single result". The way I meant this was that one map task only
> produces a single outer product (that consist of multiple column vectors
> hence multiple mapper emits), but those are not the ones to combine in this
> case, right?
> 
> 2012/9/26 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
> 
>> Yes, but one int/vector pair corresponds to the respective column of A
>> multiplied by an element of the respective row of B, correct? So the
>> concatenation of the resulting columns would be outer product of the column
>> of A and the row of B. None of these vectors are summed up but rather the
>> outer products of multiple map tasks are summed up. So what is the job of
>> the combiner here? It would be nice if the combiner could sum up all outer
>> products computed on that datanode, but this is the part I can't see
>> happening in Hadoop. Is the general statement correct that a combiner is
>> only applied to all outputs of a *map task* and that a map task processes
>> all key-value pairs of a split? In this case, there is only one key-value
>> pair per split, right? The int/vector being index and column/row of the
>> matrix.
>>
>>
>> 2012/9/26 Jake Mannix <jake.mannix@gmail.com>
>>
>>> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann <
>>> sigurd.spieckermann@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I'm trying to understand the way the combiner in Mahout SVD works. (
>>>> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html) As far as I
>>>> know from the Mahout math matrix-multiplication implementation, matrix
>>> A is
>>>> represented by column-vectors, matrix B is represented by row vectors
>>> and
>>>> an inner join executes an outer product of the columns of A with the
>>> rows
>>>> of B. All outer products are summed by the combiners and reducers. What
>>> I
>>>> am wondering about is how a combiner can actually combine multiple outer
>>>> products on the same datanode because the join-package requires the
>>> data to
>>>> be partitioned into unsplittable files. In this case, I understand that
>>> one
>>>> file contains one column/row of its corresponding matrix. Hence, each
>>> map
>>>> task receives a column-row-tuple, computes the outer product and emits
>>> the
>>>> result.
>>>
>>>
>>> This all sounds right, but not the following:
>>>
>>>
>>>> My understanding of Hadoop is that the combiner follows a map task
>>>> immediately but one map task produces only a single result so there is
>>>> nothing to combine.
>>>
>>>
>>> That part is not true - a mapper may emit more than one key-value pair
>>> (and
>>> for
>>> matrix multiplication, this is true *a fortiori* - there is one int/vector
>>> pair emitted per
>>> nonzero element of the row being mapped over).
>>>
>>>
>>>> If the combiner could accumulate the results of
>>>> multiple map task, I would understand the idea, but from my
>>> understanding
>>>> and tests, it does not.
>>>>
>>>> Could anyone clarify the process please?
>>>>
>>>> Thanks a lot!
>>>> Sigurd
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>   -jake
>>>
>>
>>
>