lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: n-gram indexing
Date Mon, 18 Jul 2005 22:11:24 GMT

Your intuition is right, but i can't think of any reason why you need to
add the n-grams at indexing time -- or why using phrase queries would be a
bad thing in this case.  When you get a multiword query, construct the
n-grams of the query words as multiple phrase queries and search for a
BooleanQuery of all those phrases (with bigger boosts given to longer
phrases)

Using your example of "united states of america" generate a query that
looks something like...

     united states america
     "united states"^10 "states america"^10
     "united states america"^100

: Intution behind adding n-grams is to boost naturally occurring larger
: phrases versus using phrase queries. For example, if I am searching for
: "united states of america", I want the search results to return the
: documents ordered as follows
:
: Rank 1 - Documents containing all the words occurring together
: Rank 2 - Documents containing maximum number of words in the same
: sentence
: Rank 3 - Documents containing all the words but some might appear in the
: same sentence some may not
: Rank 4 - Documents containig atleast one or two words
:
: If we have a n-gram index, most probably document talking about "united
: states" gets preference over document containing "united" and "states"
: seperately. If I am correct, this can be achieved without using phrase
: queries. I am not sure if there is a better way to achieve the same
: effect.
:
: Thanks,
:
: Rajesh
:
:
: -----Original Message-----
: From: Andy Roberts [mailto:mail@andy-roberts.net]
: Sent: Monday, July 18, 2005 5:56 PM
: To: java-user@lucene.apache.org
: Subject: Re: n-gram indexing
:
: On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote:
: > At what point do I add n-grams? Does the order in which I add n-grams
: > affect exact phrase queries later? My questions are
: >
: > (1) Should I add all the 1-grams followed by 2-grams followed by
: > 3-grams..etc sentence by sentence OR
: >
: > (2) Add all the 1 grams of entire document first before starting
: > 2-grams for the entire document?
: >
: > What is the general accepted notion of adding n-grams of a document?
: >
: > thanks,
: >
: > Rajesh
:
: I can't see any real advantage of storing n-grams explicitly. Just index
: the document and use phrase queries. Order is significant with phrase
: queries if I recall correctly, although you can use SpanNearQueries to
: look for unordered ngrams, although I don't know why you would want to!
:
: Perhaps if you explain a little more about what you are trying to
: achieve more generally, we can confirm that you don't need to mess with
: explicit indexing of indexing.
:
: Andy
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message