lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Why would one not use RemoveDuplicatesTokenFilterFactory?
Date Fri, 24 May 2013 13:04:52 GMT
The primary purpose of this filter is in conjunction with the 
KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not 
produce a stem from the original token, so the keyword duplicate is no 
longer needed. The goal is to index both the stemmed and unstemmed terms at 
the same position.

Whether your app is using the filter for that purpose remains to be seen.

Removing duplicates from the raw input token stream would impact the term 
frequency.

-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Friday, May 24, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Why would one not use RemoveDuplicatesTokenFilterFactory?

I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 


Mime
View raw message