lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From climbingrose <>
Subject Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?
Date Tue, 25 Sep 2007 07:50:50 GMT

Thanks Walter, 

Unfortunately some of our documents are "near duplications" which means they
are mostly identical (>75%) but usually not 100% identical. hashCode is very
sensitive to small changes so it can't be used in our case. 

Walter Ferrara-2 wrote:
> solr have unique keys, which do that "avoid duplicate" work for you, so
> you may try to make some kind of unique identifier out of the text your
> going to index, and use that as a solr <uniqueKey>.
> You could try to create a sort of hashCode or something like that from
> the text your are going to index, and use that as uniquekey of the
> schema -  the next time you're going to add the same text, you should
> get the same key, and so solr will not add it again, but just update it
> (or at least it will be a lot simpler to understand if that document is
> already present in the index).
> any other thoughts?
> --
> Walter
> climbingrose wrote:
>>>> You would get autowarming, etc, by default though - not what you want
>>> >from a searcher that is  only used for deletions.
>> As a work around, I manually initialise LRUCache instance in DUH2
>> constructor. It works but not very elegant because you can't view cache's
>> statistics info in Solr admin...
>>>> What problem are you trying to solve that requires directly using or
>>>> modifying DUH2?
>> I'm doing near duplication detection on a fairly large number of
>> documents.
>> Each document to be added to Solr will be compared with sample documents
>> from all clusters in the index. I could of course, dedupe documents at
>> client side but the performance will not be as good.
>> BTW, has anyone here done any serious near duplication detection with
>> Solr?
>> If yes, what approaches did you use?
>> Thanks.

View this message in context:
Sent from the Solr - Dev mailing list archive at

View raw message