lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Simplest way to check for an exact match on an tokenized/stored field?
Date Mon, 27 Oct 2008 17:18:59 GMT
Hi chaiguy1337,

On 10/26/2008 at 6:09 PM, chaiguy1337 wrote:
> Hi group. I have a Lucene index that contains a bunch of text documents,
> which are both tokenized (using the standard analyzer, not
> KeywordAnalyzer) and stored. Preferrably without having to create a
> duplicate KeywordAnalyzer-tokenized field, what is the simplest (and/or
> most efficient) way to check for an existing exact match on that field?
> 
> Currently my best guess is to perform a TermQuery containing
> the entire text of the document to check, and then perform a
> second pass over each of the results checking the field for
> explicit equality.

The StandardAnalyzer can produce the same set of tokens for two non-identical texts, especially
if you are using stop words, so depending on how strictly you define "exact match", you may
have to re-index.

What are you trying to do?  If you're searching for duplicates, it may make sense for you
to compute a digest of some form and store that for comparison purposes in another field.

Steve


Mime
View raw message