lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Hardin <char...@topiatechnology.com>
Subject Re: More like this returning similarities that are too generic
Date Tue, 08 Aug 2006 01:37:38 GMT
Thank you Erick, that was what I anticipated would be necessary.

There's still the issue of the queries from MoreLikeThis not  
returning results for terms I had expected ("bikes").

For example, I have these four very short documents:

"bikes are a handy tool for getting from diffrent locations on the  
earth, Ireally like bikes"
"Bikes can be really fast"
"bikes can be dangerous"
"I like bikes because they are fun"


I expected the query from MoreLikeThis to say that these short  
documents are similar because they all have "Bikes", but it doesn't.

Is there some reason for this to be so?  Perhaps because the  
documents are so short?

I then loaded some large (5K+) documents and I noticed that  
MoreLikeThis's query started to return similar documents, but explain 
() said they were similar because of words like "from" and "can" rather
than the text I expected to be used for similarity in the documents.




Chad



On Aug 7, 2006, at 1:29 PM, Erick Erickson wrote:

> Well, I expect that defining "less common" is tricky and doesn't  
> lend itself
> to a canned answer <G>. Would it work to create your own list of  
> stop words
> (possibly very large) to use for indexing and/or searching? This would
> simply exclude the "less common" words (as you define them).
> StandardAnalyzer, for instance, can take a File of stop words in  
> one of its
> constructors.......
>
> Erick
>
> On 8/7/06, Chad Hardin <chardin@topiatechnology.com> wrote:
>>
>> hi all,
>>
>> I'm new to lucene but I'm loving it!  I'm writing a prototype that
>> links documents together based upon similarities.  Obviously the
>> first thing I did was use MoreLikeThis.  However, it seems to be
>> finding matches based upon words that are too common, in this case
>> the words "from" and "can" and seems to be missing matches using the
>> terms I would expect (in this case documents about "bikes").
>>
>> I seems I need a more custom tailored Filter that only passes through
>> more less-common words.  Does something like this already exist?
>>
>> Thanks,
>>
>> Chad
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message