mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-467) Change Iterable<Cooccurrence> in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance
Date Thu, 12 Aug 2010 19:22:19 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897904#action_12897904
] 

Ted Dunning commented on MAHOUT-467:
------------------------------------

This does expose another problem, though, in that these counts should be combined in addition
to being reduced.  That means that the count++ should be something like count += value and
there should be an upstream combiner somewhere.

That will do much more good than using lists.

> Change Iterable<Cooccurrence> in  org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer
 to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-467
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> In Class AbstractDistributedVectorSimilarity
>       protected int countElements(Iterator<?> iterator)
>       { int count = 0;
>           while (iterator.hasNext()) 
>           {
>                   count++; 
>                   iterator.next(); 
>            } return count; 
>     }
> The method countElements is used continually and is called continually ,but it has bad
performance.
> If the iterator has million elements ,we have to iterate million  times to just get the
count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity 
> public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity
{
>   @Override
>   protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences,
double weightOfVectorA,
>       double weightOfVectorB, int numberOfColumns) {
>     return countElements(cooccurrences);
>   }
> }
> one items may be liked by many people, we has system ,one items may be liked by  hundred
thousand persons,
> Here doComputeResult just returned the count of elements in  cooccurrences,but It has
to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it already sets
the size of the Array or list when system constructs the List or Array.
> 2)  DistributedLoglikelihoodVectorSimilarity
> 3)  DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity 
> it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message