Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB04B10A91 for ; Thu, 6 Jun 2013 01:14:33 +0000 (UTC) Received: (qmail 85868 invoked by uid 500); 6 Jun 2013 01:14:33 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 85779 invoked by uid 500); 6 Jun 2013 01:14:33 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 85771 invoked by uid 99); 6 Jun 2013 01:14:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 01:14:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robin.anil@gmail.com designates 209.85.128.174 as permitted sender) Received: from [209.85.128.174] (HELO mail-ve0-f174.google.com) (209.85.128.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 01:14:28 +0000 Received: by mail-ve0-f174.google.com with SMTP id oz10so1761925veb.5 for ; Wed, 05 Jun 2013 18:14:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=vB3c6f92W8AI7CLJHO/9nwylFhvlpneFziAuStgF8Lc=; b=F2p7T28C85HPWXm2Tw5LCTwmKUuBmTOBuQm/8WOlpHYo2qr+jXBstxJp4nWyzMYsqR 7Lv3bVQvPLdlbzOEHTzC84l9A1Hv25hlHkHcXaIkilJItlk6KOXJBs8v+WU1SZXdHbCo wJPd79fFISLsF6fyBo44XJ5T6o03CAmsBnLRht0tz66uLsKdGWXeIv7Po19S4NrVHML1 raOw6RVzRU4veQiky7bVhalPjPnIykys5yttN5DRH1I7igAuQ3Fixsmn8/mF0yBeh+0v i256SsZHBdhNV7yCwXEvOXLSjGAz/wxknf/n4re1O2+Tmg6sU7hJZHl0qVnMf4/wayHa f3OA== X-Received: by 10.59.2.199 with SMTP id bq7mr21117237ved.51.1370481247759; Wed, 05 Jun 2013 18:14:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.170.195 with HTTP; Wed, 5 Jun 2013 18:13:47 -0700 (PDT) In-Reply-To: References: <11CFCC11-4D0F-48B3-A29B-1BD7049F11EA@di.unimi.it> From: Robin Anil Date: Thu, 6 Jun 2013 03:13:47 +0200 Message-ID: Subject: Re: Performance of primitive collections To: Mahout-Dev Content-Type: multipart/alternative; boundary=047d7bb04ce6f1ec2504de720757 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bb04ce6f1ec2504de720757 Content-Type: text/plain; charset=UTF-8 Dawid, could you explain how this is accessing the mahout trunk code? Is that part of the package step? Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Jun 5, 2013 at 8:20 PM, Sean Owen wrote: > On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss > wrote: > > 1) if you know your keys are 0...N-1 (dense) then there is typically no > > need to use a map; you'll do just as well with a regular lookup array :) > > Yeah of course... here I think there is no guarantee of this although > in many practical use cases the keys are drawn from a small range. For > example 1000 user IDs are often around the range 1 - 1000 or > something. > > > > Having written the above it made me wonder why Mahout's performance does > > not drop significantly on those malicious upper-bit changing benchmarks, > so > > I checked the code. The thing is Mahout (and I think Colt, originally) > uses > > double hashing with prime-sized tables (not 2^n-1 masking) -- this is > nice > > and indeed avoids the problems of clustering mentioned in pt. 2 above. > > Yes I thought this was why you could largely get away with it. If the > table is big enough, collisions and chains should not be a big deal to > begin with. > > Since you have this benchmark readily available -- what happens if you > un-comment one or both of the lines of code that mixes the hash? That > would really answer it. > > i'm curious as I have certainly made this assumption independently in > the custom hashtable implementations I made long ago (which also use > double hashing) and still use in the project. If it's not optimal > that's important info. > --047d7bb04ce6f1ec2504de720757--