lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Munavalli" <raje...@dessci.com>
Subject RE: n-gram indexing
Date Tue, 19 Jul 2005 14:21:06 GMT
Let me explain a scenario where I would need to add the n-grams at
indexing time. 

Consider two documents:

Document 1:
"united states is .... United airlines operates in 50 states. United
states government....."

Document 2:
"united states is .... United airlines operates in 50 states. United
some other word states"

If you consider the tf-idf weight of individual terms "united" and
"states" they would have exact score in both the documents. But the term
"united states" should have higher weight in Document 1. The higher
weight can be achieved through bi-grams which would include the word
"united states". My guess is, lucene will retrieve both documents with
equal score.

Thanks,

Rajesh



-----Original Message-----
From: hossman@hal.rescomp.berkeley.edu
[mailto:hossman@hal.rescomp.berkeley.edu] On Behalf Of Chris Hostetter
Sent: Monday, July 18, 2005 5:11 PM
To: java-user@lucene.apache.org
Subject: RE: n-gram indexing


Your intuition is right, but i can't think of any reason why you need to
add the n-grams at indexing time -- or why using phrase queries would be
a bad thing in this case.  When you get a multiword query, construct the
n-grams of the query words as multiple phrase queries and search for a
BooleanQuery of all those phrases (with bigger boosts given to longer
phrases)

Using your example of "united states of america" generate a query that
looks something like...

     united states america
     "united states"^10 "states america"^10
     "united states america"^100

: Intution behind adding n-grams is to boost naturally occurring larger
: phrases versus using phrase queries. For example, if I am searching
for
: "united states of america", I want the search results to return the
: documents ordered as follows
:
: Rank 1 - Documents containing all the words occurring together
: Rank 2 - Documents containing maximum number of words in the same
: sentence
: Rank 3 - Documents containing all the words but some might appear in
the
: same sentence some may not
: Rank 4 - Documents containig atleast one or two words
:
: If we have a n-gram index, most probably document talking about
"united
: states" gets preference over document containing "united" and "states"
: seperately. If I am correct, this can be achieved without using phrase
: queries. I am not sure if there is a better way to achieve the same
: effect.
:
: Thanks,
:
: Rajesh
:
:
: -----Original Message-----
: From: Andy Roberts [mailto:mail@andy-roberts.net]
: Sent: Monday, July 18, 2005 5:56 PM
: To: java-user@lucene.apache.org
: Subject: Re: n-gram indexing
:
: On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote:
: > At what point do I add n-grams? Does the order in which I add
n-grams
: > affect exact phrase queries later? My questions are
: >
: > (1) Should I add all the 1-grams followed by 2-grams followed by
: > 3-grams..etc sentence by sentence OR
: >
: > (2) Add all the 1 grams of entire document first before starting
: > 2-grams for the entire document?
: >
: > What is the general accepted notion of adding n-grams of a document?
: >
: > thanks,
: >
: > Rajesh
:
: I can't see any real advantage of storing n-grams explicitly. Just
index
: the document and use phrase queries. Order is significant with phrase
: queries if I recall correctly, although you can use SpanNearQueries to
: look for unordered ngrams, although I don't know why you would want
to!
:
: Perhaps if you explain a little more about what you are trying to
: achieve more generally, we can confirm that you don't need to mess
with
: explicit indexing of indexing.
:
: Andy
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message