Hi everyone


We are using Lucene to search for possible duplicates in an address database. We create an index with a document for each person in the database. Each document has a field with one term for the first name, a field with one term for the last name and so on. I think in this setting it doesn’t make sense to let term frequency, inverse document frequency and friends influence the document score (or does it?). For this reason I’m thinking of overriding DefaultSimilarity to not take tf/idf into account when scoring.


Do you think that’s a reasonable thing to do? If so, how should I proceed (I’m looking for implementation details here; should I, e.g., override the method that calculates the term frequency to just return a constant [altought, at the top of my head, I wouldn’t know what would be a sensible constant.]).


