Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8647C6F2E for ; Wed, 1 Jun 2011 05:08:35 +0000 (UTC) Received: (qmail 1195 invoked by uid 500); 1 Jun 2011 05:08:34 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 1075 invoked by uid 500); 1 Jun 2011 05:08:34 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 1067 invoked by uid 99); 1 Jun 2011 05:08:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 05:08:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.42] (HELO mail-vw0-f42.google.com) (209.85.212.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 05:08:25 +0000 Received: by vwl1 with SMTP id 1so9136693vwl.1 for ; Tue, 31 May 2011 22:08:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.73.15 with SMTP id o15mr542191vcj.53.1306904884041; Tue, 31 May 2011 22:08:04 -0700 (PDT) Sender: chris@cellixis.com Received: by 10.220.203.133 with HTTP; Tue, 31 May 2011 22:08:04 -0700 (PDT) In-Reply-To: References: Date: Tue, 31 May 2011 22:08:04 -0700 X-Google-Sender-Auth: 4_WZm--9DRHwNAJpro3Dxu-nj8Q Message-ID: Subject: Re: Why do userid & itemid have to be long? From: Chris Schilling To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0016363b81785e59f004a49f814f X-Virus-Checked: Checked by ClamAV on apache.org --0016363b81785e59f004a49f814f Content-Type: text/plain; charset=ISO-8859-1 I have a test set of 6M preferences (500k users, 500k items). I recently switched my infrastructure to use Long sequential ids for users and items. Before this we were using Strings. I was able to read in a map file for userIds and itemIds into a Java HashMap. Conversions took negligible amount of time. This sounds insance for only 5M prefs. On Tue, May 31, 2011 at 9:51 PM, Mike Khristo wrote: > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira ( > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set > with > ~300k rows like: > > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118 > > It's slowly doing the translations: > INFO: [+++][MONGO-MAP] Adding Translation Item ID: > 4d57d54434ac9fd3570005a2 long_value: 145367 > > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec. > 8G ram, 4 virtual cores > > With a test data set of 3M preferences, that would take >5 days, just for > the translation. > > Open to ideas/suggestions/"a-ha"-moments. Thanks! > > > > > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning > wrote: > > > It makes the internals much cleaner to not repeat this conversion. > > > > But how is it that this is taking a long time? String -> lookup should > not > > be much longer than an array access, especially if you use the Mahout > > collections or one of the dictionary types. > > > > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo > > wrote: > > > > > Rather, how can I use string-based userid/itemid's without having the > > deal > > > with the slowness associated with mapping them to a long? > > > > > > In the MongoDataModel, for example, significant time/overhead goes into > > > converting the unique id's to long... I'm still getting my head > wrapped > > > around mahout, but this seems like a significant limitation. I have to > > > assume there's some logic behind the decision to restrict them to long, > > but > > > i didn't find anything about it in Mahout in Action or the list. > > > > > > Thanks. > > > > > > --0016363b81785e59f004a49f814f--