Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@minotaur.apache.org Received: (qmail 89145 invoked from network); 16 Mar 2010 05:18:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 05:18:05 -0000 Received: (qmail 77143 invoked by uid 500); 16 Mar 2010 05:18:04 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 77080 invoked by uid 500); 16 Mar 2010 05:18:03 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 77072 invoked by uid 99); 16 Mar 2010 05:18:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 05:18:02 +0000 X-ASF-Spam-Status: No, hits=-1.1 required=10.0 tests=AWL,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 05:17:56 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1NrP9s-0007Ip-2C; Mon, 15 Mar 2010 22:17:36 -0700 Date: Mon, 15 Mar 2010 22:17:36 -0700 To: lucy-dev@lucene.apache.org Subject: MoreLikeThisQuery Message-ID: <20100316051735.GB27885@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey Greets, Lucene has a MoreLikeThisQuery in contrib: http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html It functions by selecting a handful of high-value (i.e. rare) terms out of a document and building up a composite ORQuery based on those. The thing that's always bothered me about its results is that it gets thrown off by things like proper names. Proper names are often very rare, and thus highly discriminatory terms. They often pass all the heuristics that MoreLikeThisQuery uses: low doc_freq() (meaning occurs in few documents), long token length (more than 5 characters), etc. The problem is that if you have e.g. two authors with the same (uncommon) last name, but these authors write on entirely different subjects, MoreLikeThisQuery will often conflate them. However, there is a potential remedy available if we use clustering. Say that the heuristics yield this collection of terms: economics capital interest investment addison One of these things is not like the others. :) Meaning, if you look at all those terms in a vector space, most of them will be clustered together, but one will be way far away. What I'd like to do is identify the cluster that best represents the document, and exclude any terms outside of that cluster when building the MoreLikeThisQuery. What kind of a data structure would we need to achieve that? Marvin Humphrey