lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Hardin <char...@topiatechnology.com>
Subject Re: More like this returning similarities that are too generic
Date Tue, 08 Aug 2006 16:52:39 GMT
You're soo right!  I'm totally new to lucene (and text analyses,  
searching etc), but now that you showed me I "get it".  Thank you so  
much for your reply.


Chad


On Aug 8, 2006, at 12:45 AM, Chris Hostetter wrote:

>
> I've never used MoreLikeThis myself, but based on how i know it works,
> your problem probably has more to do with the size of your test  
> corpus and
> th frequency of the words in your docs then by the size of the docs
> themselves.
>
> : There's still the issue of the queries from MoreLikeThis not
> : returning results for terms I had expected ("bikes").
>
> A quick glance at the source for MoreLikeThis turns up these lines...
>
>     /**
>      * Ignore terms with less than this frequency in the source doc.
> 	 * @see #getMinTermFreq
> 	 * @see #setMinTermFreq
>      */
>     public static final int DEFAULT_MIN_TERM_FREQ = 2;
>
>     /**
>      * Ignore words which do not occur in at least this many docs.
> 	 * @see #getMinDocFreq
> 	 * @see #setMinDocFreq
>      */
>     public static final int DEFALT_MIN_DOC_FREQ = 5;
>
> ...which i'm guessing mean that unless a word appears in a doc at  
> least
> twice, it's ignored for that doc, and unless a word appears in at  
> least 5
> docs, it's ignored completely.  that could easily explain your bike
> examples.
>
> : I then loaded some large (5K+) documents and I noticed that
> : MoreLikeThis's query started to return similar documents, but  
> explain
> : () said they were similar because of words like "from" and "can"  
> rather
> : than the text I expected to be used for similarity in the documents.
>
> Other then a stop words list, one other thing you might consider is
> the notion of a "maxDocFreq" option you could set to ignore words that
> appear in lots of documents -- or a maxDocFreqRatio that would take a
> percentage of the total number of docs ... it should be fairly
> straightforward to add.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message