Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Mon, 15 Mar 2010 22:17:36 -0700
To: lucy-dev@lucene.apache.org
Subject: MoreLikeThisQuery
Message-ID: <20100316051735.GB27885@rectangular.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

Greets,

Lucene has a MoreLikeThisQuery in contrib:

  http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

It functions by selecting a handful of high-value (i.e. rare) terms out of a
document and building up a composite ORQuery based on those. 

The thing that's always bothered me about its results is that it gets thrown
off by things like proper names.  

Proper names are often very rare, and thus highly discriminatory terms.  They
often pass all the heuristics that MoreLikeThisQuery uses: low doc_freq()
(meaning occurs in few documents), long token length (more than 5 characters),
etc.

The problem is that if you have e.g. two authors with the same (uncommon) last
name, but these authors write on entirely different subjects,
MoreLikeThisQuery will often conflate them.

However, there is a potential remedy available if we use clustering.  Say that
the heuristics yield this collection of terms:

    economics capital interest investment addison 
  
One of these things is not like the others.  :)  Meaning, if you look at all
those terms in a vector space, most of them will be clustered together, but
one will be way far away.

What I'd like to do is identify the cluster that best represents the document,
and exclude any terms outside of that cluster when building the
MoreLikeThisQuery.   

What kind of a data structure would we need to achieve that?

Marvin Humphrey