Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28C7A9C04 for ; Fri, 2 Dec 2011 12:48:04 +0000 (UTC) Received: (qmail 67691 invoked by uid 500); 2 Dec 2011 12:48:03 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 67484 invoked by uid 500); 2 Dec 2011 12:48:03 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 67476 invoked by uid 99); 2 Dec 2011 12:48:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 12:48:03 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 12:48:01 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 0526FAF768 for ; Fri, 2 Dec 2011 12:47:40 +0000 (UTC) Date: Fri, 2 Dec 2011 12:47:40 +0000 (UTC) From: "Manuel Blechschmidt (Commented) (JIRA)" To: dev@mahout.apache.org Message-ID: <412302201.34697.1322830060022.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1470127633.34505.1322825260521.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAHOUT-906) Allow collaborative filtering evaluators to use custom logic in splitting data set MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161594#comment-13161594 ] Manuel Blechschmidt commented on MAHOUT-906: -------------------------------------------- Actually it would be a good idea to implement time based splitting. Normally we want a recommender to predict ratings for items that we are going to like in the future and this should be the evaluation basis for the recommendations. In an ecommerce scenario you want the recommender to predict the item that you are going to buy next. Therefore you have to hide the newest items. The university of hildesheim (Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme) wrote a paper in 2010 where they are combining SVD + HMM and are able to outperform a standard recommender: http://www.ismll.uni-hildesheim.de/pub/pdfs/RendleFreudenthaler2010-FPMC.pdf > Allow collaborative filtering evaluators to use custom logic in splitting data set > ---------------------------------------------------------------------------------- > > Key: MAHOUT-906 > URL: https://issues.apache.org/jira/browse/MAHOUT-906 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.5 > Reporter: Anatoliy Kats > Priority: Minor > Labels: features > Original Estimate: 48h > Remaining Estimate: 48h > > I want to start a discussion about factoring out the logic used in splitting the data set into training and testing. Here is how things stand: There are two independent evaluator based classes: AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly into a training and testing set. GenericRecommenderIRStatsEvaluator takes one user at a time, removes their top AT preferences, and counts how many of them the system recommends back. > I have two use cases that both deal with temporal dynamics. In one case, there may be expired items that can be used for building a training model, but not a test model. In the other, I may want to simulate the behavior of a real system by building a preference matrix on days 1-k, and testing on the ratings the user generated on the day k+1. In this case, it's not items, but preferences(user, item, rating triplets) which may belong only to the training set. Before we discuss appropriate design, are there any other use cases we need to keep in mind? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira