Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82792 invoked from network); 12 Mar 2008 13:50:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Mar 2008 13:50:04 -0000 Received: (qmail 9622 invoked by uid 500); 12 Mar 2008 13:49:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 9596 invoked by uid 500); 12 Mar 2008 13:49:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 9585 invoked by uid 99); 12 Mar 2008 13:49:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2008 06:49:54 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 72.14.214.224 as permitted sender) Received: from [72.14.214.224] (HELO hu-out-0506.google.com) (72.14.214.224) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2008 13:49:17 +0000 Received: by hu-out-0506.google.com with SMTP id 27so2302086hub.15 for ; Wed, 12 Mar 2008 06:49:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=oLDYiy3miBqec3LLuV0hxT2e+F0G6n3m58HIjYpnGN4=; b=lCPVlXjKVqNgmohW+bOdI+KmpyHesCW8dofKoZQOJH3pM+5VF3KK4NXSwvSH0mklA4Ubpx9H2sTVaMrpZQ5yqf9NVvvIxFQ44CiSbOim6hwChAndsRh9VtdLINcHDw96l2C4/ndrqikxScv0pgcrWT+wbl6phZGYFMdK2k+vxaE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=i/6Qzeg8vowboJi8+mzqI1PLeErUKzWpmJLx951uAcAGeXCNNjKxZN89Ix0lL75uqYZfxCkClFa23/FEsrj+XwJscIhaiOpu9KP9UlJhSF72OOR40seqQmAGclR97rMPaU+bv4FPNKUO2ysX+Fy5jUNm3XsPJMjr8NUn7XYsB+o= Received: by 10.82.158.12 with SMTP id g12mr19831586bue.0.1205329766759; Wed, 12 Mar 2008 06:49:26 -0700 (PDT) Received: by 10.82.151.7 with HTTP; Wed, 12 Mar 2008 06:49:26 -0700 (PDT) Message-ID: <359a92830803120649t6349f937u9395523d76479d8c@mail.gmail.com> Date: Wed, 12 Mar 2008 09:49:26 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Unique Fields In-Reply-To: <47D771A2.7070703@searchcapital.net> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_5689_22455366.1205329766750" References: <47D69F7B.6020802@searchcapital.net> <359a92830803111037u476bd57dxb7ea90972933d0ad@mail.gmail.com> <47D771A2.7070703@searchcapital.net> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_5689_22455366.1205329766750 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline So, you're tokenizing the title field? If so, I don't understand how you expect this to work. Would the title "this is one order" and "is one order this" be considered identical? Would capitalization matter? Punctuation? Throwing all the terms of a title into a tokenized field and expecting some magic to keep duplicates is beyond the scope of Lucene, you'll have to roll some customized solution. For instance, index your title UN_TOKENIZED in a duplicate field (after applying whatever massaging you want re: punctuation, spaces, etc.). Use TermDocs/TermEnum on that field to detect duplicates. You won't search on this field.... Or create a hash of the title and index *that* in a separate field and check against the hash with termenum/terndocs. Or..... But no, there's no magic that makes Lucene DWIM (Do What I Mean)... Best Erick On Wed, Mar 12, 2008 at 2:01 AM, Ion Badita wrote: > The "problem" is that my unique field is a title, many terms per field. > I want to make an index with titles and i don't want to have duplicates. > > John > > > Erick Erickson wrote: > > You can easily find whether a term is in the index with > TermEnum/TermDocs > > (I think TermEnum is all you really need). > > > > Except, you'll probably also have to keep an internal map of IDs added > since > > the searcher was opened and check against that too. > > > > Best > > Erick > > > > On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita < > ion.badita@searchcapital.net> > > wrote: > > > > > >> Hi, > >> > >> I want to create an index with one unique field. > >> Before inserting a document i must be sure that "unique field" is > unique. > >> > >> > >> > >> John > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> > > > > > > ------=_Part_5689_22455366.1205329766750--