mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
Date Tue, 11 Oct 2016 16:08:20 GMT


Hudson commented on MAHOUT-1883:

SUCCESS: Integrated in Jenkins build Mahout-Quality #3398 (See [])
MAHOUT-1883 closes no PR, adds dataset filtering for minimal needed to (pat: rev 1f5e36f249aabc68495ec15f64f5ed6754d9f1e2)
* (edit) mr/pom.xml
* (edit) distribution/pom.xml
* (edit) spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
* (edit) hdfs/pom.xml
* (edit) flink/pom.xml
* (edit) math/pom.xml
* (edit) examples/pom.xml
* (edit) h2o/pom.xml
* (edit) spark/pom.xml
* (edit) pom.xml
* (edit) spark-shell/pom.xml
* (edit) spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala
* (edit) buildtools/pom.xml
* (edit) math-scala/pom.xml
* (edit) integration/pom.xml

> Create a type if IndexedDataset that filters unneeded data for CCO
> ------------------------------------------------------------------
>                 Key: MAHOUT-1883
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.13.0
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
> The collaborative filtering CCO algo uses drms for each "indicator" type. The input must
have the same set of user-id and so the row rank for all input matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in secondary
matrices. This can lead to very large amounts of data processed in the CCO pipeline that does
not affect the results. Put another way if the row doesn't exist in the primary matrix, there
will be no cross-occurrence in the other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P so this
Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of
interactions but that uses the dictionary from P for row-ids and filters out all data that
doesn't correspond to P. The companion object will create the row-ids dictionary if it is
not passed in, and use it to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this technique.
This could be handled outside of Mahout but always produces better performance and so this
version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a future Jira.

This message was sent by Atlassian JIRA

View raw message