mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lariven (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
Date Sat, 13 Jun 2015 11:47:01 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584549#comment-14584549
] 

lariven edited comment on MAHOUT-1739 at 6/13/15 11:46 AM:
-----------------------------------------------------------

the unit test in the project is at hand to use.
mvn test -Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), "--maxSimilaritiesPerItem",
"2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the weird switch of itemID with otherItemID. this may loss some target items of it's similars
and append some similars to other target items.


was (Author: lariven):
the unit test in the project is at hand to use.
mvn test -Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), "--maxSimilaritiesPerItem",
"2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the weird switch of itemID with otherItemID. this may loss some target items of it's similars
and append same similars to other target items.

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-1739
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.10.0
>            Reporter: lariven
>              Labels: easyfix, patch
>         Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed the number
of similar items we set to maxSimilarItemsPerItem  parameter. the following code of ItemSimilarityJob.java
about line NO. 200 may affect:
>         if (itemID < otherItemID) {
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity()));
>         } else {
>           ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity()));
>         }
> Don't know why need to switch itemID with otherItemID, but I think a single line is enough:
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message