spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangfei (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-4001) Add Apriori algorithm to Spark MLlib
Date Tue, 21 Oct 2014 09:38:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178183#comment-14178183
] 

wangfei edited comment on SPARK-4001 at 10/21/14 9:38 AM:
----------------------------------------------------------

.


was (Author: scwf):


Thanks Sean Owen for explaining! Frequent itemset algorithm works by scanning the input data
set, there is no probabilistic model in nature. 

To answer Xiangrui Meng’s earlier questions:
1.	These algorithm is used for finding major patterns / association rules in a data set. For
a real use case, some analytic applications in telecom domain use them to find subscriber
behavior from the data set combining service record, network traffic record, and demographic
data. Please refer to this Chinese article for example:  http://www.ss-lw.com/wxxw-361.html
And, sometimes we use frequent itemset algorithm for preparing features input to other algorithm
which selects feature and do other ML task like training a classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922,


2.	Since Apriori is a basic algorithm for frequent itemset mining, I am not aware of any parallel
implementation for it. But I think the algorithm fits Spark’s data parallel model since
it only need to scan the input data set. And for FP-Growth, I do know there is a Parallel
FP-Growth from Haoyuan Li: http://dl.acm.org/citation.cfm?id=1454027  . I think I probably
will refer  to this paper to implement FP-Growth in Spark 

3.	The Apriori computation complexity is about O(N*k) where N is the number of item in input
data and k is the depth of the frequent item tree to search. FP-Grwoth complexity is about
O(N),  it is more efficient comparing to Apriori. For space efficiency, FP-growth is also
more efficient than Apriori. But in case of smaller data and if frequent itemset is more,
Apriori is more efficient. This is because FP-Growth need to construct a FP Tree out of the
input data set, and it needs some time. And another advantage of Apriori is that it can output
association rules while FP-Growth can not.

Although these two algorithms are basic algo (FP-Growth is more complex), I think it will
be handy if mllib can include them since there is no frequent itemset mining algo in Spark
yet, and especially in distributed environment. 
Please suggest how to handle this issue. Thanks a lot. 


> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
>                 Key: SPARK-4001
>                 URL: https://issues.apache.org/jira/browse/SPARK-4001
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Jacky Li
>            Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a transactional data
set.  It will be useful if Apriori algorithm is added to MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message