mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhaskar Devireddy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1007) Performance improvement in recommenditembased by splitting long records
Date Mon, 07 May 2012 18:40:51 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269860#comment-13269860
] 

Bhaskar Devireddy commented on MAHOUT-1007:
-------------------------------------------

The first map task in unsymmetrify job has very long execution time compare to other map tasks
in the job with ASF Mail dataset.  This map task runs on a single core for longer period of
time performing more work than others in the same job.  This patch is addressing the issue
by splitting the data evenly between the map tasks so all of them can finish in the same amount
of time. There is overhead in splitting the data but the map tasks processing the evenly split
data can run in parallel on several cores, which makes this job more scalable.  We did measure
the performance gains with the patch and Unsymmetrify job gains more than 6X on x86 architectures.
 Our test cluster has 4 data nodes with 8 cores each(Total of 32 cores for the cluster).
                
> Performance improvement in recommenditembased by splitting long records
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1007
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1007
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Bhaskar Devireddy
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Patch_1007.patch
>
>
> While running the recommendations with ASFEMail dataset using the example script provided
with mahout, we are noticing that one of the map task in unsymmetrify mapper job has a very
long execution time than others.  While profiling, the problem seems to be with the number
of elements in each record.  The attached patch address this issue by splitting longer records
into smaller once, so the data distributed evenly among the unsymmetrify map tasks.
> There is a new command line option maxSimilarityReducerVectorSize is introduced for RecommanderJob.
 Tested with maxSimilarityReducerVectorSize=5000 and with same functionality speeds up unsymmetrify
mapper job by several X on x86 architectures and increases CPU utilization.  By default the
records are not split and setting the command line option maxSimilarityReducerVectorSize to
a value greater than 0 will increase performance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message