mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raphael Cendrillon <>
Subject Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA
Date Sun, 18 Dec 2011 22:41:00 GMT
Sure. Github is actually much easier for me. Generating patches while working on multiple jiras
gets messy :)

On Dec 18, 2011, at 2:25 PM, Dmitriy Lyubimov <> wrote:

> PS if it is not terribly difficult, if you could post your patch on
> github, it would be awesome (with complete mahout history based on
> Then we can merge it more easily in case it gets out of sync with the
> trunk HEAD.
> Thank you for doing this.
> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <> wrote:
>> If i had to guess, the mapper reported time should be under 1 minute
>> regardless of the input size on any __non-vm__ machine (unless it is
>> IBM XT :) even with -Xmx200m which is hadoop default.
>> The reducer depends on the input size, but unless you manage to
>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>> Thanks.
>> -Dmitriy
>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>> <> wrote:
>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>>> Let me try out some larger data sets and see how it runs. Do you have any suggestions
/ expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix
the job should take around z minutes?
>>> As a follow up, would it be worth starting work on the 'brute force' job for
subtracting the average from each of the rows?
>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <>
>>>>    [
>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>> -----------------------------------------
>>>> Raphael, thank you for seeing this thru.
>>>> Q:
>>>> 1) -- why do you need vector class for the accumulator now? mean is kind
of expected to be dense in the end, if not in the mappers then at least in the reducer for
sure. And secondly, if you want to do this, why don't your api would accept a class instance,
not a "short" name? that would be consistent with the Hadoop Job and file format apis which
kind of take classes, not strings.
>>>> 2) --  I know you have a unit test, but did you test it on a simulated input,
like say 2G big? if not, i will have to test it before you proceed.
>>>> As a next step, i guess i need to try it out to see if it works on various
kind of inputs.
>>>>> Row mean job for PCA
>>>>> --------------------
>>>>>                Key: MAHOUT-923
>>>>>                URL:
>>>>>            Project: Mahout
>>>>>         Issue Type: Improvement
>>>>>         Components: Math
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Raphael Cendrillon
>>>>>           Assignee: Raphael Cendrillon
>>>>>            Fix For: Backlog
>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed
Row Matrix for use in PCA.
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
>>>> For more information on JIRA, see:

View raw message