lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From climbingrose <climbingr...@gmail.com>
Subject Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?
Date Tue, 25 Sep 2007 07:50:50 GMT

Thanks Walter, 

Unfortunately some of our documents are "near duplications" which means they
are mostly identical (>75%) but usually not 100% identical. hashCode is very
sensitive to small changes so it can't be used in our case. 


Walter Ferrara-2 wrote:
> 
> solr have unique keys, which do that "avoid duplicate" work for you, so
> you may try to make some kind of unique identifier out of the text your
> going to index, and use that as a solr <uniqueKey>.
> 
> You could try to create a sort of hashCode or something like that from
> the text your are going to index, and use that as uniquekey of the
> schema -  the next time you're going to add the same text, you should
> get the same key, and so solr will not add it again, but just update it
> (or at least it will be a lot simpler to understand if that document is
> already present in the index).
> 
> any other thoughts?
> --
> Walter
> 
> climbingrose wrote:
>>   
>>>> You would get autowarming, etc, by default though - not what you want
>>>>       
>>> >from a searcher that is  only used for deletions.
>>>     
>>
>> As a work around, I manually initialise LRUCache instance in DUH2
>> constructor. It works but not very elegant because you can't view cache's
>> statistics info in Solr admin...
>>
>>   
>>>> What problem are you trying to solve that requires directly using or
>>>> modifying DUH2?
>>>>       
>>
>> I'm doing near duplication detection on a fairly large number of
>> documents.
>> Each document to be added to Solr will be compared with sample documents
>> from all clusters in the index. I could of course, dedupe documents at
>> client side but the performance will not be as good.
>>
>> BTW, has anyone here done any serious near duplication detection with
>> Solr?
>> If yes, what approaches did you use?
>>
>> Thanks.
>>   
> 
> 

-- 
View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12874713
Sent from the Solr - Dev mailing list archive at Nabble.com.


Mime
View raw message