lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Any way to ignore repeated terms in TF calculation?
Date Sat, 27 Dec 2008 00:38:03 GMT
Hi Israel,

you can solve your problem at search time by passing a custom  
Similarity class that looks something like this:

>   private Similarity similarity = new DefaultSimilarity() {
>     public float tf(float v) {
>       return 1f;
>     }
>     public float tf(int i) {
>       return 1f;
>     }
>   };


See javadocs for details.

   karl

25 dec 2008 kl. 14.20 skrev Israel Tsadok:

> A recurring problem I have with Lucene results is when a document  
> contains
> the same word over and over again. If for some reason I have a  
> document
> containing "badger badger badger badger badger badger badger  
> badger", it
> would appear high on the search results for "badger", even though it's
> usually a useless document.
> What I would like to do is ignore repeating words when counting the  
> term
> frequency. At first, I thought I could achieve this by indexing with a
> TokenFilter that would skip repeated tokens, but then a search for  
> e.g.
> "Rochelle Rochelle" would return no results.
>
> What I would like is to index all 8 "badger"s, but have the  
> frequency of
> "badger" saved as 1. Is that even possible?
>
> Digging around in Lucene code, I found term frequency calculations
> in FreqProxTermsWriterPerField.addTerm() - is that where I need to  
> look?
>
> Any help would be appreciated.
> Israel


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message