Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 94359 invoked from network); 12 Jun 2005 22:57:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Jun 2005 22:57:25 -0000 Received: (qmail 97704 invoked by uid 500); 12 Jun 2005 22:57:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 96938 invoked by uid 500); 12 Jun 2005 22:57:20 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 96922 invoked by uid 99); 12 Jun 2005 22:57:20 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of clamprecht@gmail.com designates 64.233.170.204 as permitted sender) Received: from rproxy.gmail.com (HELO rproxy.gmail.com) (64.233.170.204) by apache.org (qpsmtpd/0.28) with ESMTP; Sun, 12 Jun 2005 15:57:17 -0700 Received: by rproxy.gmail.com with SMTP id c51so418566rne for ; Sun, 12 Jun 2005 15:57:15 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=VliGDe2m6Upm2e+FouiaZKy+R/TOFNJKThhahLZDYyULwYlDXDTgAmJ1TqVYKe2XtNOBpk3HbUTGRerJ55IUoQkmVRHHnB2VexJbJ+pr0uH0U/Cvqkk2OKP0xiZP4iYKe0D6fG9sGrvyV5cLV+7qrr4ZfAYjfRnMtN5tS0hpXJ4= Received: by 10.38.66.68 with SMTP id o68mr703830rna; Sun, 12 Jun 2005 15:57:15 -0700 (PDT) Received: by 10.38.104.77 with HTTP; Sun, 12 Jun 2005 15:57:15 -0700 (PDT) Message-ID: <88c6a67205061215577c9955ae@mail.gmail.com> Date: Sun, 12 Jun 2005 17:57:15 -0500 From: Chris Lamprecht Reply-To: Chris Lamprecht To: java-user@lucene.apache.org, Dave Kor Subject: Re: Ideas Needed - Finding Duplicate Documents In-Reply-To: <1118612055.42acaa57af6bd@sms.ed.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <42ABFE88.8050206@axtelsoft.com> <1118587070.42ac48be751be@sms.ed.ac.uk> <88c6a67205061211465fc2bd23@mail.gmail.com> <1118612055.42acaa57af6bd@sms.ed.ac.uk> X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I'd have to see your indexing code to see if there are any obvious performance gotchas there. If you can run your indexer under a profiler (OptimizeIt, JProbe, or just the free one with java using -Xprof), it will tell you in which methods most of your CPU time is spent. If you're using StandardAnalyzer, then this may be it -- StandardAnalyzer is a fairly advanced grammar-based parser, but it is pretty slow. If you don't need its functionality, then try using a simpler Analyzer, (like WhitespaceAnalyzer or a subclass). As far as changing a document within an index -- there is no "update" operation for documents, there's just delete and add (and then optimize). Delete only marks docs as deleted (so they don't come back in search results); they aren't physically removed from the index files until you optimize. Also, it isn't fatal that your current index doesn't have MD5 info in it. It's pretty fast to compute MD5 at search time for each document returned (much faster than the I/O-bound part -- actually retrieving the docs from the Lucene index). So you could try just doing all your duplicate detection at search time. If this is too slow, you could consider caching the computed MD5 for your docs. -chris On 6/12/05, Dave Kor wrote: > Thanks for the quick reply, Chris. >=20 > Yes, when I say "duplicate" sentences, they are exact copies of the same = string. >=20 > The MD5 hash is a good idea, I wish I had thought of it earlier as it wou= ld have > saved me a lot of trouble. Right now it is not feasible to reindex again = because > indexing is a very slow and cpu intensive task for me. I'm adding > part-of-speech, chunk, named entity and coreference information as I inde= x, > which means it takes 4 separate servers and 4-5 days of processing to cre= ate a > new index. And as far as I know, you can't change the index once its crea= ted. > Am I correct? >=20 > Any other ideas that don't require me to re-index the whole thing? >=20 > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org