Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 23629 invoked from network); 31 Aug 2009 10:20:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 31 Aug 2009 10:20:50 -0000 Received: (qmail 74267 invoked by uid 500); 31 Aug 2009 10:20:50 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 74211 invoked by uid 500); 31 Aug 2009 10:20:50 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 74201 invoked by uid 99); 31 Aug 2009 10:20:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2009 10:20:50 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.66] (HELO spunkymail-a9.g.dreamhost.com) (208.97.132.66) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2009 10:20:40 +0000 Received: from [10.0.0.77] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by spunkymail-a9.g.dreamhost.com (Postfix) with ESMTP id 9DC4F1FC9E for ; Mon, 31 Aug 2009 03:20:19 -0700 (PDT) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1075.2) Subject: Re: issues found From: Grant Ingersoll In-Reply-To: Date: Mon, 31 Aug 2009 06:20:18 -0400 Content-Transfer-Encoding: 7bit Message-Id: <1C84B383-445D-4509-A593-F9B5A3D16B55@apache.org> References: To: mahout-dev@lucene.apache.org X-Mailer: Apple Mail (2.1075.2) X-Virus-Checked: Checked by ClamAV on apache.org On Aug 31, 2009, at 2:55 AM, Ted Dunning wrote: > I just did an exercise of implementing a faster sparse vector. In the > process, I uncovered a bunch of warts. I will be filing some Jiras > and > patches as soon as I can get to them, but here is a preview: > > a) most serialization code for vectors would write vectors with null > names > out as if they had names of "". This causes grief in tests and > seems wrong. > > b) the Vector/SparseVector hierarchy was oddly split out. I added a > HashVector and moved the current SparseVector into > IntDoubleMappingVector > with SparseVector remaining as an abstract class. This > unfortunately caused > lots of upheaval even up into the Vector class. I have yet to sort > this out > cleanly. > > c) the squared distance functions were defined in multiple places. I > centralized these into SquaredEuclideanDistance as a static > function. These > definitions also were wrong and would ignore any components that had > opposite sign and equal magnitude. In fixing this, I went ahead and > wrote > an implementation that makes use of any sparsity present. To do > that, I > added a method to Vector so that I could tell if a vector is > sparse. This > subsumes all of the distance optimizations that we have discussed. > It also > makes it very easy for new code to use these optimizations without > knowing > about them. > > d) the nonZeroIterator had to be substantially refactored to work in > an > abstract class without knowing to much about the internals of > everything. > > e) in order ot make sure that all metrics worked reasonably with all > types, > I have heavily refactored the testing structure for metrics. I will > be > doing more of this as well. The goal is to test all vector types > against > all distance metrics to make sure that they give the same results as > DenseVector. Then, I will build tests for DenseVector to verify > that it > produces the correct result. This is important because so many > vectors have > special code to help metric computation. > > f) I have uncovered some strangeness in the ARFF I/O code that I > introduced > with the SparseVector abstraction. The old code will work as it > did, but it > won't understand the new HashVector. > > > Soo.... > > I will be trying to break this down into as small a pieces as I can, > but the > total will be a bunch of interdependent patches. If anybody can > help me > apply these as quickly as possible, we should have minimal problems > with it > all. If it drags out, it will get hard to keep rebasing the patch > sequence > all the time. > These all sound reasonable. If you're happy w/ the changes, just put up a big patch on an issue, give it a few days to percolate and then commit. > > > -- > Ted Dunning, CTO > DeepDyve -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search