Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 22692 invoked from network); 21 Feb 2007 22:00:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Feb 2007 22:00:29 -0000 Received: (qmail 77723 invoked by uid 500); 21 Feb 2007 22:00:27 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 77672 invoked by uid 500); 21 Feb 2007 22:00:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77641 invoked by uid 99); 21 Feb 2007 22:00:27 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Feb 2007 14:00:27 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 64.233.182.184 as permitted sender) Received: from [64.233.182.184] (HELO nf-out-0910.google.com) (64.233.182.184) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Feb 2007 14:00:16 -0800 Received: by nf-out-0910.google.com with SMTP id i2so320109nfe for ; Wed, 21 Feb 2007 13:59:55 -0800 (PST) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=KpXZylwAhyruooBgnRwr667sjw9y8tAD5C3znGoJtdGgCVm8UnX9o2HKmxcTmN5mOXgpABQJgqyMcgRI8wu7QTlXvgdjKfNRN85FnZNIsCHT9bwYvkNCL6AsNOMDTvnJnVNIRbLDRYYFyWcHe45pz5mqv+oXHj3LKsoGL7Mhk3I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Jcgh4DMckQ53mgh7OACide/SiF2HTmnC/tMEZLyUiY659R8xneEBx4F2rsF/4anf3zNDi+Qtdkqs48gzybojKoCiw5Uf4GUXYJc+ehV9Ir2fXz8kDErLa/Rj5ZuOSoAlJY+ZSzf+tF29S8l1sg1t7zF350zycfB5LAgKdpA4dkE= Received: by 10.82.188.15 with SMTP id l15mr14291939buf.1172095194627; Wed, 21 Feb 2007 13:59:54 -0800 (PST) Received: by 10.82.162.20 with HTTP; Wed, 21 Feb 2007 13:59:54 -0800 (PST) Message-ID: <359a92830702211359r2705dcbegdbfa6411b3b4b170@mail.gmail.com> Date: Wed, 21 Feb 2007 16:59:54 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Positions in SpanFirst In-Reply-To: <45DCABC9.7060303@teamware.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_13629_8619331.1172095194053" References: <45DC28E0.4070604@teamware.com> <359a92830702210516h2d0faa55id9a71f5e8bd1312c@mail.gmail.com> <45DCABC9.7060303@teamware.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_13629_8619331.1172095194053 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline See below.. On 2/21/07, Antony Bowesman wrote: > > Hi Erick, > > > I'm not sure you can, since all the interfaces I use alter the increment > > between successive terms, but I'll be the first to admit that there are > > many > > nooks and crannies that I don't know about... But I suspect that a > negative > > increment is not supported intentionally.... > > I read your other interesting post about omitting termvector info and this > led > me to find Analyzer.getPositionIncrementGap. The javadocs state > > "Invoked before indexing a Field instance if terms have already been added > to > that field..." > > so I thought that sounded good, but there does not seem to be a way to set > it > and most of the Analyzers just seem to use the base Analyzer method which > returns 0, so I'm now confused as to what this actually does in practice. What this does is allow you to put gaps between successive sets of terms indexed in the same field. For instance... doc.add("field", "some stuff"); doc.add("field", "bunch hooey"); doc.add("field", "what is this"); writer.add(doc); In this case, there would be the following positions, assuming that the IncrementGap was 1000.... some 0 stuff 1 bunch 1002 hooey 1003 what 2004 is 2005 this 2006 It was a little hard to get my head around. The purpose is to be able to increment things in a single field in a document, but have some sense of grouping. > But I really doubt you want to do this due to the consequences. Consider > in > > your example the terms would have the following offsets > > first 0 > > bit 1 > > second 0 > > part 1 > > third 0 > > section 1 > > > > Now think about a proximity query "first section"~1. This would produce > a > > hit because you've changed the whole sense of what offsets mean. Is this > > really a good thing? > > That's a good point. The field is used to index mail recipients and > currently > has a "starts with" search (non Lucene implementation). Unless I can set > the > position increment gap, it is only ever possible to search for the first > indexed > recipient with proxity queries.\ This is confusing me. You can easily use proximity queries with the above scenario. For instance, searching for "bunch hooey"~4 would generate a hit. As would "bunch hooey"~10000. But "some this"~10 would not generate a hit. Whether that does what you need is another question ... So it's time to ask "what are you really trying to do?" In other words, what behavior are you trying to mimic from the old code? It's not clear to me what the behavior you need is. It'd help if you gave a concrete example of the raw data, and what you want returned... In your first example, using the above scheme, you'd get hits (using SpanNear rather than SpanFirst) if you searched on "first bit" in a SpanNear query with a slop of 2. You'd also get a hit if you searched on "second part" in a SpanNear with a slop of 2. Does this mimic the behavior you need? NOTE:, my "first bit" with slop shorthand above would actually be constructed by instantiating a SpanNear query with two SpanTermQuerys in the consctructor.... Best Erick I'm trying to ensure the Lucene implementation provides at least the > original > functionality. If I can't achieve it I can just document the > limitation. If I > can, I may get false hits, but I still have the choice to filter the hits > and > weed out the false ones before being given to the client. It's not a > showstopper, it would be good it it could be done. > > Thanks > Antony > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_13629_8619331.1172095194053--