lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Multiple Keywords/Keyphrases fields
Date Wed, 16 Feb 2005 08:23:31 GMT
On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
> > From: Erik Hatcher <erik@ehatchersolutions.com>
> > Date: February 12, 2005 3:09:15 PM MST
> > To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> > Subject: Re: Multiple Keywords/Keyphrases fields
> >
> >
> > The real question to answer is what types of queries you're planning 
> > on making.  Rather than look at it from indexing forward, consider it 
> > from searching backwards.
> >
> > How will users query using those keyword phrases?
> 
> Hi Erik.  Good point.
> 
> There are two uses we are making of the keyphrases:
> 
> 	- Graphical Navigation: A Flash graphical browser will allow users to 
> fly around in a space of documents, choosing what to be viewing: 
> Authors, Keyphrases and Textual terms.  In any of these cases, the 
> "closeness" of any of the fields will govern how close they will appear 
> graphically.  In the case of authors, we will weight collaboration .. 
> how often the authors work together.  In the case of Keyphrases, we 
> will want to use something like distance vectors like you show in the 
> book using the cosine measure.  Thus the keyphrases need to be separate 
> entities within the document .. it would be a bug for us if the terms 
> leaked across the separate kephrases within the document.
> 
> 	- Textual Search: In this case, we will have two ways to search the 
> keyphrases.  The first would be like the graphical navigation above 
> where searching for "complex system" should require the terms to be in 
> a single keyphrase.  The second way will be looser, where we may simply 
> pool the keyphrases with titles and abstract, and allow them all to be 
> searched together within the document.
> 
> Does this make sense?  So the question from the search standpoint is: 
> do multiple instances of a field act like there are barriers across the 
> instances, or are they somehow treated as a single instance somehow.  

Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name, 
the analyzer is asked to provide a new tokenstream. 

This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.

> In terms of the closeness calculation, for example, can we get separate 
> term vectors for each instance of the keyphrase field, or will we get a 
> single vector combining all the keyphrase terms within a single 
> document?

The positions in the TermVectors are treated in the same way.

To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.

A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.

Regards,
Paul Elschot

> 
> I hope this is clear!  Kinda hard to articulate.
> 
> Owen
> 
> > 	Erik
> >
> > On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
> >
> >> I'm getting a bit more serious about the final form of our lucene 
> >> index.  Each document has DocNumber, Authors, Title, Abstract, and 
> >> Keywords.  By Keywords, I mean a comma separated list, each entry 
> >> having possibly many terms in a phrase like:
> >> 	temporal infomax, finite state automata, Markov chains,
> >> 	conditional entropy, neural information processing
> >>
> >> I presume I should be using a field "Keywords" which have many 
> >> "entries" or "instances" per document (one per comma separated 
> >> phrase).  But I'm not sure the right way to handle all this.  My 
> >> assumption is that I should analyze them individually, just as we do 
> >> for free text (the Abstract, for example), thus in the example above 
> >> having 5 entries of the nature
> >> 	doc.add(Field.Text("Keywords", "finite state automata"));
> >> etc, analyzing them because these are author-supplied strings with no 
> >> canonical form.
> >>
> >> For guidance, I looked in the archive and found the attached email, 
> >> but I didn't see the answer.  (I'm not concerned about the dups, I 
> >> presume that is equivalent to a boos of some sort) Does this seem 
> >> right?
> >>
> >> Thanks once again.
> >>
> >> Owen
> >>
> >>> From: lucene@nitwit.de <lucene@nitwit.de>
> >>> Subject: Multiple equal Fields?
> >>> Date: Tue, 17 Feb 2004 12:47:58 +0100
> >>>
> >>> Hi!
> >>> What happens if I do this:
> >>>
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "blah"));
> >>>
> >>> Is there a field "foo" with value "blah" or are there two "foo"s 
> >>> (actually not
> >>> possible) or is there one "foo" with the values "bar" and "blah"?
> >>>
> >>> And what does happen in this case:
> >>>
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "bar"));
> >>>
> >>> Does lucene store this only once?
> >>>
> >>> Timo
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message