Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of cdoronc@gmail.com designates
 72.14.220.155 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=UBb+2XJfhxm2Zpxoemak6b5TXIx3TL6C8yrlR9UvX6q+Wo4hvhk9bRhhNnLj5NrgQIQaJ7dfUZGASdQnLEFNtqUrBenBl0GEJZZJizbPxjhmsrROGBo92SyzpADKKrZg8Y+wtp+mi2zE9vg2DQdW+a+O6AWOsmVow4Z4DrVpvHQ=
Message-ID: <e05f0fd10802130635g5df94c1bibfc7fc8007c19053@mail.gmail.com>
Date: Wed, 13 Feb 2008 16:35:14 +0200
From: "Doron Cohen" <cdoronc@gmail.com>
To: general@lucene.apache.org
Subject: Re: multiple instances of fields or attributes
In-Reply-To: <47B1FCAA.1010704@ice-sa.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_14308_26373606.1202913314875"
References: <47A62EBD.9000704@ice-sa.com>
	 <43B50191-B6E4-48E8-A721-A6C56DEF3DD3@ehatchersolutions.com>
	 <47A76926.9040307@ice-sa.com>
	 <802011C0-E192-409C-9AD3-6FAC82F6909D@ehatchersolutions.com>
	 <47AB2BEE.6020200@ice-sa.com>
	 <e05f0fd10802121101y5e4244ffk8f1f8e2475ace539@mail.gmail.com>
	 <47B1FCAA.1010704@ice-sa.com>

------=_Part_14308_26373606.1202913314875
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

See below...

On Tue, Feb 12, 2008 at 10:08 PM, Andr=E9 Warnier <aw@ice-sa.com> wrote:

>
>
> Doron Cohen wrote:
> > On Thu, Feb 7, 2008 at 6:03 PM, Andr=E9 Warnier <aw@ice-sa.com> wrote:
> >
> >> ...
> >> Does anyone have an example of how this works ?
> >> (or an explanation in plain French-speaker-friendly tutorial-like
> English
> >> ?)
> >>
> >
> > Do you mean "how to make it work for you" or "how does it work inside"?
> > The first option is easier to explain (though I know no French :))
> > When you create an IndexWritier you provide it an Analyzer.
> > That analyzer is used when a document is added to the index.
> > The analyzer.getPositionIncrementGap() specifies the position
> > gap between separate additions of same field. By default it
> > returns 0 (which is not working well in your example). To modify this
> > you can override this method in "your" analyzer to return a nonzero gap=
,
> > for example 5. This is easy when subclassing any existing analyzer.
> >
> > Doron
> >
>
> Now I may be starting to get it (although we French-speaking guys are
> slow (but thorough)).  Do you mean the following (add question mark at
> end) :
> - imagine that I would create a field "descriptors" for each of my
> documents
> - prior to adding a "phrase" to the "descriptors" field, I pass it
> through an Analyser, the Analyser breaks it down into words, and notes
> for each word the position in the phrase...


This is true. Just note that (1) "passing-through-the-analyzer" is usually
done
for you by the IndexWriter, and (2) you are adding text (rather than
phrase),
and that text - depending on the field properties - is analyzed into tokens=
.

- then the Analyser feeds it into the index, where the individual words
> are stored, together with their relative position in the "phrase"...
> - so that, for instance (ignoring any stripping of stopwords), the
> phrase "the white cat jumped over the sleeping dog"  is now stored in
> the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the
> 7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions
> in the phrase/field..


Yes, though usually starting in position 0.

- so that, if I later search for "white cat"~1 in "dsecriptors", it will
> find this document, bacause the "distance" between "white" and "cat" is
> 1 (or 0, depending how one counts) ..


Yes, though the default is 0,  so "white jumped" would not match
but "white jumped"~1 will match.

- now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my
> Analayser, then for the second addition to the same "descriptors" field,
> it will start the numbering at 19 (?).


Yes

- thus if for instance the second instance of "descriptors" is the
> phrase "the cow bit the cat", this will be indexed as "19:the 20:cow
> 21:bit 22:the 23:cat".
> - and when searching for "dog cow"~5, it would not find this document,
> because the gap betweeb "8:dog" and "20:cow" is greater than 5 ?
>
> Is it something like that, or have I not got it at all ?


Yes it is.

To generalise my question, what I would like to know is this : assuming
> I have two "descriptors" for the same document : "Electrical and
> Electronic Engineering" and "Engineering Studies".
> Is there a way to index this document (among others), and to later do a
> search which will find the documents which have a "descriptors"
> containing both "Electronic" and "Studies" in the same instance of
> "descriptors", thus not finding this one ?


Yes, you can do this by specifying a large enough gap, using either sloppy
phrase query (as above) or using span-near-queries.

Luke is a tool that allows to search and inspect a Lucene index.
I think you will find it useful.

- Doron

------=_Part_14308_26373606.1202913314875--