Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@locus.apache.org Received: (qmail 2420 invoked from network); 25 Sep 2007 07:51:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Sep 2007 07:51:21 -0000 Received: (qmail 23856 invoked by uid 500); 25 Sep 2007 07:51:11 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 23827 invoked by uid 500); 25 Sep 2007 07:51:11 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 23818 invoked by uid 99); 25 Sep 2007 07:51:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 00:51:11 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 07:51:11 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1Ia5By-00029C-MJ for solr-dev@lucene.apache.org; Tue, 25 Sep 2007 00:50:50 -0700 Message-ID: <12874713.post@talk.nabble.com> Date: Tue, 25 Sep 2007 00:50:50 -0700 (PDT) From: climbingrose To: solr-dev@lucene.apache.org Subject: Re: Implication of not calling closeSearcher() in DirectUpdateHandler2? In-Reply-To: <46F7DA3B.8020003@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: climbingrose@gmail.com References: <12857591.post@talk.nabble.com> <12861178.post@talk.nabble.com> <12861789.post@talk.nabble.com> <46F7DA3B.8020003@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Thanks Walter, Unfortunately some of our documents are "near duplications" which means they are mostly identical (>75%) but usually not 100% identical. hashCode is very sensitive to small changes so it can't be used in our case. Walter Ferrara-2 wrote: > > solr have unique keys, which do that "avoid duplicate" work for you, so > you may try to make some kind of unique identifier out of the text your > going to index, and use that as a solr . > > You could try to create a sort of hashCode or something like that from > the text your are going to index, and use that as uniquekey of the > schema - the next time you're going to add the same text, you should > get the same key, and so solr will not add it again, but just update it > (or at least it will be a lot simpler to understand if that document is > already present in the index). > > any other thoughts? > -- > Walter > > climbingrose wrote: >> >>>> You would get autowarming, etc, by default though - not what you want >>>> >>> >from a searcher that is only used for deletions. >>> >> >> As a work around, I manually initialise LRUCache instance in DUH2 >> constructor. It works but not very elegant because you can't view cache's >> statistics info in Solr admin... >> >> >>>> What problem are you trying to solve that requires directly using or >>>> modifying DUH2? >>>> >> >> I'm doing near duplication detection on a fairly large number of >> documents. >> Each document to be added to Solr will be compared with sample documents >> from all clusters in the index. I could of course, dedupe documents at >> client side but the performance will not be as good. >> >> BTW, has anyone here done any serious near duplication detection with >> Solr? >> If yes, what approaches did you use? >> >> Thanks. >> > > -- View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12874713 Sent from the Solr - Dev mailing list archive at Nabble.com.