lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: More like this returning similarities that are too generic
Date Tue, 08 Aug 2006 07:45:58 GMT

I've never used MoreLikeThis myself, but based on how i know it works,
your problem probably has more to do with the size of your test corpus and
th frequency of the words in your docs then by the size of the docs

: There's still the issue of the queries from MoreLikeThis not
: returning results for terms I had expected ("bikes").

A quick glance at the source for MoreLikeThis turns up these lines...

     * Ignore terms with less than this frequency in the source doc.
	 * @see #getMinTermFreq
	 * @see #setMinTermFreq
    public static final int DEFAULT_MIN_TERM_FREQ = 2;

     * Ignore words which do not occur in at least this many docs.
	 * @see #getMinDocFreq
	 * @see #setMinDocFreq
    public static final int DEFALT_MIN_DOC_FREQ = 5;

...which i'm guessing mean that unless a word appears in a doc at least
twice, it's ignored for that doc, and unless a word appears in at least 5
docs, it's ignored completely.  that could easily explain your bike

: I then loaded some large (5K+) documents and I noticed that
: MoreLikeThis's query started to return similar documents, but explain
: () said they were similar because of words like "from" and "can" rather
: than the text I expected to be used for similarity in the documents.

Other then a stop words list, one other thing you might consider is
the notion of a "maxDocFreq" option you could set to ignore words that
appear in lots of documents -- or a maxDocFreqRatio that would take a
percentage of the total number of docs ... it should be fairly
straightforward to add.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message