lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chaiguy1337 <>
Subject RE: Simplest way to check for an exact match on an tokenized/stored field?
Date Mon, 27 Oct 2008 17:27:02 GMT

Thanks for the reply Steve. Yes I am aware that the standard analyzer can
"match" two different texts. That's why my idea included a final pass
iterating over each of the (potentially multiple) returned results and
checking the actual content of the field explicitly against the target
document for equality. This is possible since the field is stored (as well
as tokenized).

Yes, the point of this is to find duplicates, or rather identify a duplicate
before it is stored.

Let me simplify the question to this: Which is more efficient:

1) perform a TermQuery involving the entire content of a document (usually
this will be small, but could theoretically be the entire contents of a text
file), then iterate over its hits and check each one manually for string
equality based on the stored field, or

2) compute a hash/digest for each stored document and store it as a keyword
field and use this to identify matches. I should still technically iterate
and verify the match since it is possible (though presumably much less
likely) for two documents to have the same hash.

Now that I think about it, the latter is probably better, even though it
involves storing additional data in the index. I get the impression it will
be more efficient.

On that note, are there any gotchas to watch out for in computing a string
has? I presume it would also have to be represented as a string for Lucene
to index it properly.


Steven A Rowe wrote:
> Hi chaiguy1337,
> On 10/26/2008 at 6:09 PM, chaiguy1337 wrote:
>> Hi group. I have a Lucene index that contains a bunch of text documents,
>> which are both tokenized (using the standard analyzer, not
>> KeywordAnalyzer) and stored. Preferrably without having to create a
>> duplicate KeywordAnalyzer-tokenized field, what is the simplest (and/or
>> most efficient) way to check for an existing exact match on that field?
>> Currently my best guess is to perform a TermQuery containing
>> the entire text of the document to check, and then perform a
>> second pass over each of the results checking the field for
>> explicit equality.
> The StandardAnalyzer can produce the same set of tokens for two
> non-identical texts, especially if you are using stop words, so depending
> on how strictly you define "exact match", you may have to re-index.
> What are you trying to do?  If you're searching for duplicates, it may
> make sense for you to compute a digest of some form and store that for
> comparison purposes in another field.
> Steve

View this message in context:
Sent from the Lucene - General mailing list archive at

View raw message