mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality
Date Sun, 13 Mar 2011 18:18:59 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006239#comment-13006239
] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Right, I was testing vipuls dataset(MAHOUT-617) and was seeing the same issue. Was the header
table having the node even after alpha pruning?

bq. I also noticed that fpgrowth implementation can be optimized by not calculating patterns
ending with given attributes multiple times. Depending on for how many features patterns are
generated, speedup can be huge. More feature included - greater speedup. For mentioned test
data, if all features were selected (i.e. we want to generate patterns for all items in transactions),
patterns generation time dropped from 1h 15min to 8sec

This might be useful for single node. For PFPGrowth this used to create issues with exact
counts of patterns earlier. There is a lot of code here(:thumbs up:) for me to verify. Some
issues

1) The dataset needs to have a signed agreement before can include in the Mahout codebase(see
the website). Can you add another test to reproduce the test case. See MAHOUT-617
2) Again the comparison code, use a different dataset.
3) Can you split the optimization out of this into another patch. I want to test more before
checking it in.
4) Bug fix by setting support = 0 maynot save the extra memory such nodes take. Its good for
now, before a permanent solution is found.



> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support
is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail)
found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated
in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test
only point out one example to hace something to focus on. There is plenty more patterns with
support higher than the real one. The biggest difference I noticed was support 8 higher than
the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains
test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message