Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 972AF6EE1 for ; Wed, 1 Jun 2011 05:07:44 +0000 (UTC) Received: (qmail 99650 invoked by uid 500); 1 Jun 2011 05:07:43 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 99596 invoked by uid 500); 1 Jun 2011 05:07:43 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 99446 invoked by uid 99); 1 Jun 2011 05:07:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 05:07:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of goksron@gmail.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qw0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 05:07:32 +0000 Received: by qwi4 with SMTP id 4so5925618qwi.1 for ; Tue, 31 May 2011 22:07:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=rljBZPXAYLJDBl3Or70GRvdTlOr6GD4Fo4MGmEejSR8=; b=ouDRCTyGVy+TTViidmTmD4jbS6V6FTl0J/10FnvSylexTQx+htuVc55M3Kij+OsGR6 pPMfA13vZVhmxmFdNwksH7pdKTU/1hHfjYSbVB+0QWk+O3sCBu0zzZaZguELOpCgvcP9 kGf2Zcen0MzL1mOn0WBrTr4hJFKe9QwG8ca2Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=t/zG1TiHiDRLfApky+NsWKtJLGeV6bBRDriW1HrKHniteBkUOArL4IKM5BgB9YJpKx Se9hqfo9p+db9s7WZ3cQ1DZq+UxW4v3VWuEhPtqsBj1QhIX5e2CwWl+b8AIVNaVJ3i+A cv7bnnnThmprSAo8NudP5BRzolHHa9eq1amTY= MIME-Version: 1.0 Received: by 10.229.28.3 with SMTP id k3mr4957182qcc.108.1306904831790; Tue, 31 May 2011 22:07:11 -0700 (PDT) Received: by 10.229.226.1 with HTTP; Tue, 31 May 2011 22:07:11 -0700 (PDT) In-Reply-To: References: Date: Tue, 31 May 2011 22:07:11 -0700 Message-ID: Subject: Re: Why do userid & itemid have to be long? From: Lance Norskog To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable UserID and ItemID are usually domain-level keys, not generated by the DB. With some of the movie databases, you get tables of "user/item/pref/time", "item/moviename/genre", and maybe "user/geocode". Lance On Tue, May 31, 2011 at 9:51 PM, Mike Khristo wrote= : > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira ( > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set wi= th > ~300k rows like: > > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118 > > It's slowly doing the translations: > INFO: [+++][MONGO-MAP] Adding Translation =C2=A0 =C2=A0Item ID: > 4d57d54434ac9fd3570005a2 long_value: 145367 > > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec. > 8G ram, 4 virtual cores > > With a test data set of 3M preferences, that would take >5 days, just for > the translation. > > Open to ideas/suggestions/"a-ha"-moments. Thanks! > > > > > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning wrot= e: > >> It makes the internals much cleaner to not repeat this conversion. >> >> But how is it that this is taking a long time? =C2=A0String -> lookup sh= ould not >> be much longer than an array access, especially if you use the Mahout >> collections or one of the dictionary types. >> >> On Tue, May 31, 2011 at 7:50 PM, Mike Khristo >> wrote: >> >> > Rather, how can I use string-based userid/itemid's without having the >> deal >> > with the slowness associated with mapping them to a long? >> > >> > In the MongoDataModel, for example, significant time/overhead goes int= o >> > converting the unique id's to long... =C2=A0I'm still getting my head = wrapped >> > around mahout, but this seems like a significant limitation. I have to >> > assume there's some logic behind the decision to restrict them to long= , >> but >> > i didn't find anything about it in Mahout in Action or the list. >> > >> > Thanks. >> > >> > --=20 Lance Norskog goksron@gmail.com