lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Munavalli" <raje...@dessci.com>
Subject RE: n-gram indexing
Date Mon, 01 Aug 2005 16:15:02 GMT
Hi Chris,
         The method you suggested is definitely a good solution. However there is one more
reason I would like to do n-gram generation at indexing time. The examples below are a text
search equivalent of what I am trying to do for a different kind of data. Anyway the example
should convey the message.

Example 1:
Phrase query:  "united ? of america"
Aim: "?" should match a single word.
Possible Outcome: "?" should match "states" in the phrase "united states of america".

Example 2:
Phrase query: "united * america"
Aim: "*" should match multiple words.
Possible Outcome: "*" can match one or more words in the phrase "united states of america".

         If I am correct, Lucene provides wild card searches only in the terms indexed. I
want to extend this capability to phrase level.
         
         There might be other ways to do which I am not aware of. Let me know what your thoughts
on this. I would really appreciate any suggestions you might have.

thanks,

Rajesh Munavalli

-----Original Message-----
From: hossman@hal.rescomp.berkeley.edu on behalf of Chris Hostetter
Sent: Fri 7/29/2005 6:11 PM
To: java-user@lucene.apache.org
Subject: RE: n-gram indexing
 
: Document 1:
: "united states is .... United airlines operates in 50 states. United
: states government....."
:
: Document 2:
: "united states is .... United airlines operates in 50 states. United
: some other word states"
:
: If you consider the tf-idf weight of individual terms "united" and
: "states" they would have exact score in both the documents. But the term
: "united states" should have higher weight in Document 1. The higher
: weight can be achieved through bi-grams which would include the word
: "united states". My guess is, lucene will retrieve both documents with
: equal score.

I believe yyour intuition is correct, but did you try what i suggested
below? .. i don't believe you have to index all of hte possible n-grams,
just construct a boolean query containing the n-grams, and boost the
larger n-grams (by some factor only experimenting will tell you).

you should also try out SpanNearQuery ... I believe it scores thingsw
higher if they appear closer together -- but i'm not confident of my
memory in htat case.

Seriously, try what i suggest below ... experimentation is the best way to
answer questions like this...

: -----Original Message-----
: Sent: Monday, July 18, 2005 5:11 PM
: To: java-user@lucene.apache.org
: Subject: RE: n-gram indexing
:
:
: Your intuition is right, but i can't think of any reason why you need to
: add the n-grams at indexing time -- or why using phrase queries would be
: a bad thing in this case.  When you get a multiword query, construct the
: n-grams of the query words as multiple phrase queries and search for a
: BooleanQuery of all those phrases (with bigger boosts given to longer
: phrases)
:
: Using your example of "united states of america" generate a query that
: looks something like...
:
:      united states america
:      "united states"^10 "states america"^10
:      "united states america"^100


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



Mime
View raw message