mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <>
Subject [jira] [Created] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
Date Sat, 01 Oct 2016 21:23:20 GMT
Pat Ferrel created MAHOUT-1883:

             Summary: Create a type if IndexedDataset that filters unneeded data for CCO
                 Key: MAHOUT-1883
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.13.0
            Reporter: Pat Ferrel
            Assignee: Pat Ferrel
             Fix For: 0.13.0

The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have
the same set of user-id and so the row rank for all input matrices must be the same.

In the past we have padded the row-id dictionary to include new rows only in secondary matrices.
This can lead to very large amounts of data processed in the CCO pipeline that does not affect
the results. Put another way if the row doesn't exist in the primary matrix, there will be
no cross-occurrence in the other calculated cooccurrences matrix

if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira
is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions
but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond
to P. The companion object will create the row-ids dictionary if it is not passed in, and
use it to filter if it is passed in.

We have seen data that can be reduced by many orders of magnitude using this technique. This
could be handled outside of Mahout but always produces better performance and so this version
of data-prep seems worth including.

It does not effect the CLI version yet but could be included there in a future Jira.

This message was sent by Atlassian JIRA

View raw message