incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject MoreLikeThisQuery
Date Tue, 16 Mar 2010 05:17:36 GMT

Lucene has a MoreLikeThisQuery in contrib:

It functions by selecting a handful of high-value (i.e. rare) terms out of a
document and building up a composite ORQuery based on those. 

The thing that's always bothered me about its results is that it gets thrown
off by things like proper names.  

Proper names are often very rare, and thus highly discriminatory terms.  They
often pass all the heuristics that MoreLikeThisQuery uses: low doc_freq()
(meaning occurs in few documents), long token length (more than 5 characters),

The problem is that if you have e.g. two authors with the same (uncommon) last
name, but these authors write on entirely different subjects,
MoreLikeThisQuery will often conflate them.

However, there is a potential remedy available if we use clustering.  Say that
the heuristics yield this collection of terms:

    economics capital interest investment addison 
One of these things is not like the others.  :)  Meaning, if you look at all
those terms in a vector space, most of them will be clustered together, but
one will be way far away.

What I'd like to do is identify the cluster that best represents the document,
and exclude any terms outside of that cluster when building the

What kind of a data structure would we need to achieve that?

Marvin Humphrey

View raw message