Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 42366 invoked from network); 2 Apr 2010 14:07:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Apr 2010 14:07:42 -0000 Received: (qmail 41524 invoked by uid 500); 2 Apr 2010 14:07:42 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 41406 invoked by uid 500); 2 Apr 2010 14:07:42 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 41397 invoked by uid 99); 2 Apr 2010 14:07:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 14:07:42 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dawid.weiss@gmail.com designates 72.14.220.153 as permitted sender) Received: from [72.14.220.153] (HELO fg-out-1718.google.com) (72.14.220.153) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 14:07:36 +0000 Received: by fg-out-1718.google.com with SMTP id d23so673952fga.5 for ; Fri, 02 Apr 2010 07:07:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:received:message-id:subject:to:content-type; bh=XvEdHPtpNzPUF66E3bU3DZenZXE6mf+D9JEhHfoveMs=; b=GtHb9iqs18pZpl/80082XrNK9gfK38y2BoNYvJ4PNTPQdNJ9aFH35K9AFsV85tYdrN NTfI96MnHuy3FcO4OALUcEDklTSwQ43ExKQPHds/42IbOHactxUfeuXd8jET5NBg9Re0 RO9CuzPjMlrbUL5SeC/As7sWvaw4OXkPHIbMI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=Dr3ZapeUTcF/kCJwmEPFHp86wi+y1Wm9P1hFdetvDlfsXu8+rSR33ACNBw1bMiQP0q kAhQINHooFoq1b5HIzfkjx2EpS1Mz5e8Lc6eDXBrElojpcwCLsZeFvEKzzX83uUz140j KF1ybXToehMSqToTIh9fuW6JxOd0DneE9RXgw= MIME-Version: 1.0 Received: by 10.103.24.15 with HTTP; Fri, 2 Apr 2010 07:06:55 -0700 (PDT) In-Reply-To: References: From: Dawid Weiss Date: Fri, 2 Apr 2010 16:06:55 +0200 Received: by 10.102.14.25 with SMTP id 25mr1223097mun.30.1270217235231; Fri, 02 Apr 2010 07:07:15 -0700 (PDT) Message-ID: Subject: Re: [collections] and what about 'identity'? To: mahout-dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 > What's the use case for needing to vary the hash function? It's one of > those things where I assume there are incorrect ways to do it, and > correct ways, and among the correct ways fairly clear arguments about > which function will be better -- i.e. the object should provide the > best function. Unfortunately this is not true -- just recently I've hit a use case where the keys stored were Long values and their distribution had a very low variance in the lower bits. HPPC implemented open hashing using 2^n arrays and hashes were modulo bitmask... this caused really, really long conflict chains for values that were actually very different. I looked at how JDK's HashMap solves this problem -- they do a simple rehashing scheme internally (so it's object hash and then remixing hash in a cascade). I've finally decided to allow external hash functions AND changed the _default_ hash function used for "remixing" to be murmur hash. Performance benchmarks show this yields virtually no degradation in execution time (the CPUs seem to spend most of their time waiting on cache misses anyway, so internal rehashing is not an issue). I must also apologize for a bit of inactivity with HPPC... Like I said, we have released it internally on our "labs" Web site here: http://labs.carrotsearch.com/hppc.html It doesn't mean we turn our backs on contributing HPPC to Mahout -- the opposite, we would love to do it. But contrary to what I originally thought (to push HPPC to Mahout as soon as possible) I kind of grew reluctant because so many things are missing (equals/hashcode, java collections adapters) or can be improved (documentation, faster iterators). So... I'm still going to experiment with HPPC in our labs, especially API-wise, release one or two versions in between and then kindly ask you to peek at the final (?) result and consider moving the code under Mahout umbrella. Sounds good? Dawid