From java-user-return-48131-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu Dec 16 03:42:14 2010 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 28892 invoked from network); 16 Dec 2010 03:42:13 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Dec 2010 03:42:13 -0000 Received: (qmail 1302 invoked by uid 500); 16 Dec 2010 03:42:11 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1023 invoked by uid 500); 16 Dec 2010 03:42:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1015 invoked by uid 99); 16 Dec 2010 03:42:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Dec 2010 03:42:10 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of fancyerii@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Dec 2010 03:42:02 +0000 Received: by wye20 with SMTP id 20so2317688wye.35 for ; Wed, 15 Dec 2010 19:41:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=hCdNSUwWEpCV2LLJDoeL+rETekP42PtS756DQfPD9IQ=; b=fr1R7Lpdwf5TY6BeYjHoun3318WoJNe/N+bjlTxg1MwwHYzF2qMBdKfVomfiR35C22 oA7dKLzJEXk0wIQnc2pRG4rTz89MWQYx7zmDqG8sLu6QJQEgdqzoBjrivvd8wyPyDKKN vH6AmHqvkFuq9pXRSg1dBh5ZN8a4/RMl1bbrc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=cmxBI3JX4FfQrrbzSLEgqLcjjDrwqJOMet3tw858y12b8Uaf10TxXPGnibwbbNIP7A inSnuwmus07Sz9DZzKtjnMoYYCExkZRhZjizMA4ofeGX3asTlntaljzr2gq+7xsY4/LW Roxxc04Oed+eTTPMDfM0LHJl6fPpWJq8ylGfw= MIME-Version: 1.0 Received: by 10.216.30.144 with SMTP id k16mr1715274wea.19.1292470902459; Wed, 15 Dec 2010 19:41:42 -0800 (PST) Received: by 10.216.53.71 with HTTP; Wed, 15 Dec 2010 19:41:42 -0800 (PST) In-Reply-To: <215672.4965.qm@web25907.mail.ukl.yahoo.com> References: <215672.4965.qm@web25907.mail.ukl.yahoo.com> Date: Thu, 16 Dec 2010 11:41:42 +0800 Message-ID: Subject: Re: Where does Lucene recognise it has encountered a new term for the first time? From: Li Li To: java-user@lucene.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I don't understand your problem well. but needing know when a new term occur is a hard problem because when new document is added, it will be added to a new segment. I think you can only do this in the last merge in optimization stage. You can read the codes in SegmentMerger.mergeTermInfos() . It merges all the terms of the merged segments. because terms are order by fieldName then term, it can use very small memory to merge terms. Or if you need knowing the new terms in current segment when building index, FreqProxTermsWriterPerField.newTerm will be called if the term occured for the first time. 2010/12/16 Mike Cawson : > I=92m using Lucene to index database records and text documents. > > I want to provide efficient fuzzy queries over the data so I=92m using a = secondary > Lucene index for all of the distinct terms encountered in the primary ind= ex. > > Each =91document=92 in the secondary index is a term from the primary ind= ex with > fields for its q-grams, phonetic key(s) and synonyms. > > It=92s easy to populate the secondary index after indexing all of the rec= ords and > text documents using an IndexReader. However, to keep the secondary index= up to > date I need to recognise when new terms are encountered for the first tim= e, but > even looking deep into Lucene code and stepping through the indexing proc= ess > hasn=92t revealed where this occurs =96 I presume because it doesn=92t ha= ppen in a > single place but rather once in the in-memory term cache, once when the c= ache is > flushed into a segment, and again when segments are optimised. > > Is this correct? Can anyone suggest how to maintain a secondary index of = terms? > Perhaps only when the main index is optimised? > > Thanks, Mike > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org