Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 41647 invoked from network); 27 Oct 2008 17:19:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Oct 2008 17:19:33 -0000 Received: (qmail 76146 invoked by uid 500); 27 Oct 2008 17:19:37 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 75706 invoked by uid 500); 27 Oct 2008 17:19:36 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 75695 invoked by uid 99); 27 Oct 2008 17:19:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Oct 2008 10:19:36 -0700 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [128.230.18.67] (HELO mx5.syr.edu) (128.230.18.67) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Oct 2008 17:18:22 +0000 Received: from suex07-hub-01.ad.syr.edu (suex07-hub-01.ad.syr.edu [128.230.108.195]) by mx5.syr.edu (8.13.7/8.13.7) with ESMTP id m9RHItkI028421 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Mon, 27 Oct 2008 13:19:01 -0400 Received: from suexmx-01.ad.syr.edu (128.230.108.65) by suex07-hub-01.ad.syr.edu (128.230.108.195) with Microsoft SMTP Server id 8.1.291.1; Mon, 27 Oct 2008 13:19:00 -0400 Received: from SUEXCL-02.ad.syr.edu ([128.230.108.46]) by suexmx-01.ad.syr.edu with Microsoft SMTPSVC(6.0.3790.3959); Mon, 27 Oct 2008 13:19:01 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: Simplest way to check for an exact match on an tokenized/stored field? Date: Mon, 27 Oct 2008 13:18:59 -0400 Message-ID: <9294E20AED46934EA459020706463F94EF40A5@SUEXCL-02.ad.syr.edu> In-Reply-To: <20178452.post@talk.nabble.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Simplest way to check for an exact match on an tokenized/stored field? Thread-Index: Ack3t5a21X3L2Cw9RY6GgE6+kNYwFwAnzfCw From: Steven A Rowe To: X-OriginalArrivalTime: 27 Oct 2008 17:19:01.0195 (UTC) FILETIME=[1AA331B0:01C93858] X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166 definitions=2008-10-27_05:2008-10-10,2008-10-27,2008-10-27 signatures=0 X-Proofpoint-Spam-Reason: safe X-Virus-Checked: Checked by ClamAV on apache.org Hi chaiguy1337, On 10/26/2008 at 6:09 PM, chaiguy1337 wrote: > Hi group. I have a Lucene index that contains a bunch of text = documents, > which are both tokenized (using the standard analyzer, not > KeywordAnalyzer) and stored. Preferrably without having to create a > duplicate KeywordAnalyzer-tokenized field, what is the simplest = (and/or > most efficient) way to check for an existing exact match on that = field? >=20 > Currently my best guess is to perform a TermQuery containing > the entire text of the document to check, and then perform a > second pass over each of the results checking the field for > explicit equality. The StandardAnalyzer can produce the same set of tokens for two = non-identical texts, especially if you are using stop words, so = depending on how strictly you define "exact match", you may have to = re-index. What are you trying to do? If you're searching for duplicates, it may = make sense for you to compute a digest of some form and store that for = comparison purposes in another field. Steve