mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hui Wen Han (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAHOUT-467) Change Iterable<Cooccurrence> in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance
Date Thu, 12 Aug 2010 13:18:17 GMT
Change Iterable<Cooccurrence> in  org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer
 to list or array to improve the performance
----------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-467
                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Hui Wen Han
             Fix For: 0.4


In Class AbstractDistributedVectorSimilarity


      protected int countElements(Iterator<?> iterator)
      { int count = 0;
          while (iterator.hasNext()) 
          {
                  count++; 
                  iterator.next(); 
           } return count; 
    }

The method countElements is used continually and is called continually ,but it has bad performance.

If the iterator has million elements ,we have to iterate million  times to just get the count
of the iterator.


this methods used in many pacles:
1) DistributedCooccurrenceVectorSimilarity 

public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity
{

  @Override
  protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences,
double weightOfVectorA,
      double weightOfVectorB, int numberOfColumns) {
    return countElements(cooccurrences);
  }

}

one items may be liked by many people, we has system ,one items may be liked by  hundred thousand
persons,
Here doComputeResult just returned the count of elements in  cooccurrences,but It has to iterate
for hundred thousand times.

If we use List or Array type,we can get the result in one call. because it already sets the
size of the Array or list when system constructs the List or Array.

2)  DistributedLoglikelihoodVectorSimilarity
3)  DistributedTanimotoCoefficientVectorSimilarity


I have doing a test using DistributedCooccurrenceVectorSimilarity 
it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message