lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen" <cdor...@gmail.com>
Subject Re: multiple instances of fields or attributes
Date Wed, 13 Feb 2008 14:35:14 GMT
See below...

On Tue, Feb 12, 2008 at 10:08 PM, André Warnier <aw@ice-sa.com> wrote:

>
>
> Doron Cohen wrote:
> > On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <aw@ice-sa.com> wrote:
> >
> >> ...
> >> Does anyone have an example of how this works ?
> >> (or an explanation in plain French-speaker-friendly tutorial-like
> English
> >> ?)
> >>
> >
> > Do you mean "how to make it work for you" or "how does it work inside"?
> > The first option is easier to explain (though I know no French :))
> > When you create an IndexWritier you provide it an Analyzer.
> > That analyzer is used when a document is added to the index.
> > The analyzer.getPositionIncrementGap() specifies the position
> > gap between separate additions of same field. By default it
> > returns 0 (which is not working well in your example). To modify this
> > you can override this method in "your" analyzer to return a nonzero gap,
> > for example 5. This is easy when subclassing any existing analyzer.
> >
> > Doron
> >
>
> Now I may be starting to get it (although we French-speaking guys are
> slow (but thorough)).  Do you mean the following (add question mark at
> end) :
> - imagine that I would create a field "descriptors" for each of my
> documents
> - prior to adding a "phrase" to the "descriptors" field, I pass it
> through an Analyser, the Analyser breaks it down into words, and notes
> for each word the position in the phrase...


This is true. Just note that (1) "passing-through-the-analyzer" is usually
done
for you by the IndexWriter, and (2) you are adding text (rather than
phrase),
and that text - depending on the field properties - is analyzed into tokens.

- then the Analyser feeds it into the index, where the individual words
> are stored, together with their relative position in the "phrase"...
> - so that, for instance (ignoring any stripping of stopwords), the
> phrase "the white cat jumped over the sleeping dog"  is now stored in
> the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the
> 7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions
> in the phrase/field..


Yes, though usually starting in position 0.

- so that, if I later search for "white cat"~1 in "dsecriptors", it will
> find this document, bacause the "distance" between "white" and "cat" is
> 1 (or 0, depending how one counts) ..


Yes, though the default is 0,  so "white jumped" would not match
but "white jumped"~1 will match.

- now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my
> Analayser, then for the second addition to the same "descriptors" field,
> it will start the numbering at 19 (?).


Yes

- thus if for instance the second instance of "descriptors" is the
> phrase "the cow bit the cat", this will be indexed as "19:the 20:cow
> 21:bit 22:the 23:cat".
> - and when searching for "dog cow"~5, it would not find this document,
> because the gap betweeb "8:dog" and "20:cow" is greater than 5 ?
>
> Is it something like that, or have I not got it at all ?


Yes it is.

To generalise my question, what I would like to know is this : assuming
> I have two "descriptors" for the same document : "Electrical and
> Electronic Engineering" and "Engineering Studies".
> Is there a way to index this document (among others), and to later do a
> search which will find the documents which have a "descriptors"
> containing both "Electronic" and "Studies" in the same instance of
> "descriptors", thus not finding this one ?


Yes, you can do this by specifying a large enough gap, using either sloppy
phrase query (as above) or using span-near-queries.

Luke is a tool that allows to search and inspect a Lucene index.
I think you will find it useful.

- Doron

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message