lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lamprecht <>
Subject Re: Ideas Needed - Finding Duplicate Documents
Date Sun, 12 Jun 2005 22:57:15 GMT
I'd have to see your indexing code to see if there are any obvious
performance gotchas there.  If you can run your indexer under a
profiler (OptimizeIt, JProbe, or just the free one with java using
-Xprof), it will tell you in which methods most of your CPU time is
spent.  If you're using StandardAnalyzer, then this may be it --
StandardAnalyzer is a fairly advanced grammar-based parser, but it is
pretty slow.  If you don't need its functionality, then try using a
simpler Analyzer, (like WhitespaceAnalyzer or a subclass).

As far as changing a document within an index -- there is no "update"
operation for documents, there's just delete and add (and then
optimize).  Delete only marks docs as deleted (so they don't come back
in search results); they aren't physically removed from the index
files until you optimize.

Also, it isn't fatal that your current index doesn't have MD5 info in
it.  It's pretty fast to compute MD5 at search time for each document
returned (much faster than the I/O-bound part -- actually retrieving
the docs from the Lucene index).  So you could try just doing all your
duplicate detection at search time.  If this is too slow, you could
consider caching the computed MD5 for your docs.


On 6/12/05, Dave Kor <> wrote:
> Thanks for the quick reply, Chris.
> Yes, when I say "duplicate" sentences, they are exact copies of the same string.
> The MD5 hash is a good idea, I wish I had thought of it earlier as it would have
> saved me a lot of trouble. Right now it is not feasible to reindex again because
> indexing is a very slow and cpu intensive task for me. I'm adding
> part-of-speech, chunk, named entity and coreference information as I index,
> which means it takes 4 separate servers and 4-5 days of processing to create a
> new index. And as far as I know, you can't change the index once its created.
> Am I correct?
> Any other ideas that don't require me to re-index the whole thing?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message