spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raj Tiwari (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-12781) MLib FPGrowth does not scale to large numbers of frequent items
Date Tue, 12 Jan 2016 21:45:39 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095007#comment-15095007
] 

Raj Tiwari edited comment on SPARK-12781 at 1/12/16 9:45 PM:
-------------------------------------------------------------

Hi Sean. Thanks for the comment. I have opened this as an improvement, not a problem. Wondering
if there is a way to improve FPGrowth implementation so it does not have to store frequent
items in memory. Is that sort of discussion best had on the mailing list or linked to a JIRA
issue such as this?


was (Author: rituraj_tiwari):
Hi Sean. Thanks for the comment. I have opened this as an improvement, not a problem. Wondering
if there is a way to improve FPGrowth implementation so it does not have to store frequent
items in memory.

> MLib FPGrowth does not scale to large numbers of frequent items
> ---------------------------------------------------------------
>
>                 Key: SPARK-12781
>                 URL: https://issues.apache.org/jira/browse/SPARK-12781
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Raj Tiwari
>
> See some background discussion here: [http://stackoverflow.com/questions/34690682/spark-mlib-fpgrowth-job-fails-with-memory-error/]
> The FPGrowth mode's {{run()}} method seems to do the following:
> # Count items
> # Generate frequent items
> # Generate frequent item sets
> The model is trained based on the outcome of the above. When generating frequent items,
the code does the following:
> data.flatMap { t =>
>       val uniq = t.toSet
>       if (t.size != uniq.size) {
>         throw new SparkException(s"Items in a transaction must be unique but got ${t.toSeq}.")
>       }
>       t
>     }.map(v => (v, 1L))
>       .reduceByKey(partitioner, _ + _)
>       .filter(_._2 >= minCount)
>       .collect()
>       .sortBy(-_._2)
>       .map(_._1)
> The {{collect()}} call in the snippet above is causing my executors to blow past any
amount of memory I can give them. Is there a way to write {{genFreqItems()}} and {{genFreqItemsets()}}
so they won't try to collect all frequent items in memory?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message