lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@kodapan.se>
Subject Re: Best practices in boosting by proximity?
Date Sat, 04 May 2013 18:41:55 GMT
The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider
permutations of  #isInOrder() with alternative query boosts.

Even though slop will create a greater score the closer the terms are, it might still in some
cases (usually when combined with other subqueries)  make sense to create a BooleanQuery that
contains the same query but with a greater boost to a smaller slop. 

You could also consider using shingles (even in combination with above) for matching documents
where the distance between two terms are. Generally it's hard to define a best practice. It
depends on the corpora your index represents, your queries and your needs.

Given your question it looks like you're using the query parser. Try something like "your
proximity query"~20, but consider the cost of a great slop.


		karl 

4 maj 2013 kl. 19:46 skrev Gili Nachum:

> Hi. *I would like for hits that contain the search terms in proximity to
> each other to be ranked higher than hits in which the terms are scattered
> across the doc.
> Wondering if there's a best practice to achieve that?*
> I also want that all hits will contain all of the search terms (implicit
> AND):
> 
> *Example:* when users search for: "lannisters always pay their debts", the
> 4 matching results should be ranked the following (for simplicity, assume
> equal field norms, and TF/IDF, in all hits):
> 1. "It is known that *Lannisters always pay their debts*"
> 2. "... Lannisters ... they sometimes *pay their debts* ... always with you"
> 3. *"Lannisters always *win ... debts ... pay tax ... their nature"
> 4. "Lannisters ... always ... pay ... their ... debts"
> 
> The first result has all 5 terms in proximity to each other.
> The second has 3 terms in proximity.
> The third has 2 terms in proximity.
> The forth has none of the terms in proximity to each other.
> 
> My current AND query that ignores proximity is: +lannisters +always +pay
> +their +debts
> So if there are M terms, I was thinking that I could add M-1 SHOULD phrase
> queries to the original query:
> "lannisters always" "always pay" "pay their" "their debts".
> 
> What are the pros and cons? Are there alternatives to consider?
> Any Lucene class that helps achieve this?
> 
> Thx!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message