Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 25248 invoked from network); 13 Feb 2008 14:35:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Feb 2008 14:35:47 -0000 Received: (qmail 40478 invoked by uid 500); 13 Feb 2008 14:35:39 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 40461 invoked by uid 500); 13 Feb 2008 14:35:39 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 40450 invoked by uid 99); 13 Feb 2008 14:35:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 06:35:39 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cdoronc@gmail.com designates 72.14.220.155 as permitted sender) Received: from [72.14.220.155] (HELO fg-out-1718.google.com) (72.14.220.155) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 14:35:07 +0000 Received: by fg-out-1718.google.com with SMTP id d23so19968fga.27 for ; Wed, 13 Feb 2008 06:35:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=OJLNPxtKVEp2SnHHQuIuxiDhD2NNuMS1ChcgSH6YCOc=; b=P5hOqNrrZzKn0SnZW28Ya6RX6G+BRkira7BMupY68nx9j0jrs9tL3el8xYl8pNWF0my8HAaYlQXsFcJMs70nKRiwnG36oYFMvNlUJBA5mk3w20ip9WdkaS2A1oyLT1+v45B2droOzbHv03m8z4zkaOdz4fT/SBR6Uvmi4P0oNPc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=UBb+2XJfhxm2Zpxoemak6b5TXIx3TL6C8yrlR9UvX6q+Wo4hvhk9bRhhNnLj5NrgQIQaJ7dfUZGASdQnLEFNtqUrBenBl0GEJZZJizbPxjhmsrROGBo92SyzpADKKrZg8Y+wtp+mi2zE9vg2DQdW+a+O6AWOsmVow4Z4DrVpvHQ= Received: by 10.86.51.2 with SMTP id y2mr2492361fgy.50.1202913314896; Wed, 13 Feb 2008 06:35:14 -0800 (PST) Received: by 10.86.87.18 with HTTP; Wed, 13 Feb 2008 06:35:14 -0800 (PST) Message-ID: Date: Wed, 13 Feb 2008 16:35:14 +0200 From: "Doron Cohen" To: general@lucene.apache.org Subject: Re: multiple instances of fields or attributes In-Reply-To: <47B1FCAA.1010704@ice-sa.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_14308_26373606.1202913314875" References: <47A62EBD.9000704@ice-sa.com> <43B50191-B6E4-48E8-A721-A6C56DEF3DD3@ehatchersolutions.com> <47A76926.9040307@ice-sa.com> <802011C0-E192-409C-9AD3-6FAC82F6909D@ehatchersolutions.com> <47AB2BEE.6020200@ice-sa.com> <47B1FCAA.1010704@ice-sa.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_14308_26373606.1202913314875 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline See below... On Tue, Feb 12, 2008 at 10:08 PM, Andr=E9 Warnier wrote: > > > Doron Cohen wrote: > > On Thu, Feb 7, 2008 at 6:03 PM, Andr=E9 Warnier wrote: > > > >> ... > >> Does anyone have an example of how this works ? > >> (or an explanation in plain French-speaker-friendly tutorial-like > English > >> ?) > >> > > > > Do you mean "how to make it work for you" or "how does it work inside"? > > The first option is easier to explain (though I know no French :)) > > When you create an IndexWritier you provide it an Analyzer. > > That analyzer is used when a document is added to the index. > > The analyzer.getPositionIncrementGap() specifies the position > > gap between separate additions of same field. By default it > > returns 0 (which is not working well in your example). To modify this > > you can override this method in "your" analyzer to return a nonzero gap= , > > for example 5. This is easy when subclassing any existing analyzer. > > > > Doron > > > > Now I may be starting to get it (although we French-speaking guys are > slow (but thorough)). Do you mean the following (add question mark at > end) : > - imagine that I would create a field "descriptors" for each of my > documents > - prior to adding a "phrase" to the "descriptors" field, I pass it > through an Analyser, the Analyser breaks it down into words, and notes > for each word the position in the phrase... This is true. Just note that (1) "passing-through-the-analyzer" is usually done for you by the IndexWriter, and (2) you are adding text (rather than phrase), and that text - depending on the field properties - is analyzed into tokens= . - then the Analyser feeds it into the index, where the individual words > are stored, together with their relative position in the "phrase"... > - so that, for instance (ignoring any stripping of stopwords), the > phrase "the white cat jumped over the sleeping dog" is now stored in > the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the > 7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions > in the phrase/field.. Yes, though usually starting in position 0. - so that, if I later search for "white cat"~1 in "dsecriptors", it will > find this document, bacause the "distance" between "white" and "cat" is > 1 (or 0, depending how one counts) .. Yes, though the default is 0, so "white jumped" would not match but "white jumped"~1 will match. - now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my > Analayser, then for the second addition to the same "descriptors" field, > it will start the numbering at 19 (?). Yes - thus if for instance the second instance of "descriptors" is the > phrase "the cow bit the cat", this will be indexed as "19:the 20:cow > 21:bit 22:the 23:cat". > - and when searching for "dog cow"~5, it would not find this document, > because the gap betweeb "8:dog" and "20:cow" is greater than 5 ? > > Is it something like that, or have I not got it at all ? Yes it is. To generalise my question, what I would like to know is this : assuming > I have two "descriptors" for the same document : "Electrical and > Electronic Engineering" and "Engineering Studies". > Is there a way to index this document (among others), and to later do a > search which will find the documents which have a "descriptors" > containing both "Electronic" and "Studies" in the same instance of > "descriptors", thus not finding this one ? Yes, you can do this by specifying a large enough gap, using either sloppy phrase query (as above) or using span-near-queries. Luke is a tool that allows to search and inspect a Lucene index. I think you will find it useful. - Doron ------=_Part_14308_26373606.1202913314875--