Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 97B37200B93 for ; Sat, 1 Oct 2016 23:24:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 964FF160AD7; Sat, 1 Oct 2016 21:24:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DB38D160AD5 for ; Sat, 1 Oct 2016 23:24:22 +0200 (CEST) Received: (qmail 54133 invoked by uid 500); 1 Oct 2016 21:24:21 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 53708 invoked by uid 99); 1 Oct 2016 21:24:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Oct 2016 21:24:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D46712C2A6B for ; Sat, 1 Oct 2016 21:24:20 +0000 (UTC) Date: Sat, 1 Oct 2016 21:24:20 +0000 (UTC) From: "Pat Ferrel (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 01 Oct 2016 21:24:23 -0000 [ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1883: ------------------------------- Description: The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix. if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. It does not affect the CLI version yet but could be included there in a future Jira. was: The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. It does not effect the CLI version yet but could be included there in a future Jira. > Create a type if IndexedDataset that filters unneeded data for CCO > ------------------------------------------------------------------ > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering > Affects Versions: 0.13.0 > Reporter: Pat Ferrel > Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix. > if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. > It does not affect the CLI version yet but could be included there in a future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)