lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Israel Tsadok" <>
Subject Any way to ignore repeated terms in TF calculation?
Date Thu, 25 Dec 2008 13:20:27 GMT
A recurring problem I have with Lucene results is when a document contains
the same word over and over again. If for some reason I have a document
containing "badger badger badger badger badger badger badger badger", it
would appear high on the search results for "badger", even though it's
usually a useless document.
What I would like to do is ignore repeating words when counting the term
frequency. At first, I thought I could achieve this by indexing with a
TokenFilter that would skip repeated tokens, but then a search for e.g.
"Rochelle Rochelle" would return no results.

What I would like is to index all 8 "badger"s, but have the frequency of
"badger" saved as 1. Is that even possible?

Digging around in Lucene code, I found term frequency calculations
in FreqProxTermsWriterPerField.addTerm() - is that where I need to look?

Any help would be appreciated.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message