mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics
Date Mon, 28 Jun 2010 16:02:50 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated MAHOUT-430:
-----------------------------

    Priority: Minor  (was: Major)
    Due Date: 30/Jun/10

Hmm yeah that doesn't look right, in the case where you have the inferrer. Let me look at
it again tonight and put in a fix if needed or remember why it's done that way.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both
compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'.
The algorithms go through each PreferenceArray in a single loop, comparing indexes to make
sure we don't fall off the end. Eventually, we get to the end of an array, which is caught
here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for
the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the
preferences we have not yet visited. In the example above, if yPrefLength<yLength, then
sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2),
but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message