spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Littlestar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-6240) Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for Large Sets
Date Tue, 10 Mar 2015 07:01:40 GMT
Littlestar created SPARK-6240:
---------------------------------

             Summary: Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory
for Large Sets
                 Key: SPARK-6240
                 URL: https://issues.apache.org/jira/browse/SPARK-6240
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.3.0
            Reporter: Littlestar
            Priority: Minor


Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for Large Sets

{noformat}
  private def genFreqItems[Item: ClassTag](
      data: RDD[Array[Item]],
      minCount: Long,
      partitioner: Partitioner): Array[Item] = {
    data.flatMap { t =>
      val uniq = t.toSet
      if (t.size != uniq.size) {
        throw new SparkException(s"Items in a transaction must be unique but got ${t.toSeq}.")
      }
      t
    }.map(v => (v, 1L))
      .reduceByKey(partitioner, _ + _)
      .filter(_._2 >= minCount)
      .collect()
      .sortBy(-_._2)
      .map(_._1)
  }
{noformat}

I use 10*10000*10000 records for test, for output all simultaneously pair.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message