Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 39749 invoked from network); 6 May 2009 00:39:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 May 2009 00:39:27 -0000 Received: (qmail 20382 invoked by uid 500); 6 May 2009 00:39:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20295 invoked by uid 500); 6 May 2009 00:39:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20284 invoked by uid 99); 6 May 2009 00:39:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 00:39:24 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.217.167] (HELO mail-gx0-f167.google.com) (209.85.217.167) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 00:39:16 +0000 Received: by gxk11 with SMTP id 11so9719853gxk.5 for ; Tue, 05 May 2009 17:38:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.75.13 with SMTP id c13mr1233935ybl.152.1241570334655; Tue, 05 May 2009 17:38:54 -0700 (PDT) In-Reply-To: <4A00CAC4.80602@teamware.com> References: <4A0018C5.9050105@teamware.com> <9ac0c6aa0905050515u70276148te0e60af94ff57ad5@mail.gmail.com> <4A00CAC4.80602@teamware.com> Date: Tue, 5 May 2009 20:38:54 -0400 Message-ID: <9ac0c6aa0905051738h4c182a56h602b113841d1bbca@mail.gmail.com> Subject: Re: How to not overwrite a Document if it 'already exists'? From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, May 5, 2009 at 7:24 PM, Antony Bowesman wrote: > Michael McCandless wrote: >> >> Lucene doesn't provide any way to do this, except opening a reader. >> >> Opening a reader is not "that" expensive if you use it for this >> purpose. =A0EG neither norms nor FieldCache will be loaded if you just >> enumerate the term docs. > > Thanks for that info. =A0These indexes will be large, in the 10s of milli= ons. > =A0id field is unique and is 29 bytes. =A0I guess that's still a lot of d= ata to > trawl through to get to the term. Have you tested how long it takes to look up docs from your id? >> But, you can let Lucene do the same thing for you by just always using >> updateDocument, which'll remove the old doc if it's present. > > That's precisely what I don't want to occur. =A0I have two forms of a > Document, which represent mail items. =A0One 'full' version containing al= l > index and stored data, which represents a searchable mail item and one > 'base', which is simply a marker Document which represents a mail in a > forwarded mail chain, with just a couple of stored fields containing the > mail meta data. > > Under normal circumstances there are no problems as mails arrive in seque= nce > and are never handled twice, but there is one case, during a reindex op, > when the arrival of those mails can come out of sequence, i.e. a full mai= l > is indexed first, but that mail is later processed as part of a forwarded > mail chain of another mail. > > It is the second time that mail is handled as a base mail that I do not w= ant > it to overwrite the full version. > > Would it be technically difficult to support something like this in the > IndexWriter API and if not, would it end up being more efficient that usi= ng > a reader/terms to check this? Couldn't you just give the base & full docs different ids? Then you can independently choose which one to update? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org