lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Preetam Rao" <>
Subject matching sub phrases in user entered query...
Date Mon, 14 Jul 2008 17:14:26 GMT

Sorry if you get this mail second time. Having some trouble with mail box.

Is there a query in Lucene which matches sub phrases ?

For example if the document text is "new york  existing homes *3 bed 2
bath*homes 3 miles from city center 2 rooms" and if user enters
"Brooklyn homes
with *3 bed *rooms  and swimming pools", I would like to recognize the fact
the the document contained a sub prefix of the user query and give it more
score compared to a document which contained all the terms, but not in
phrases, for example " new york *2 bed 3 bath"* (should not match). Also
note that I do not want the *3 *and *2 *in *3 *miles and *2*  rooms to
affect the query when I have a sub phrase match.

This is something of a middle ground between pure 'boolean OR' query and an
'exact phrase query'. Here  the more the number of sub phrases and their
lengths, the higher the score compared to individual term matches.

I was redirected to Shingle filter which is a token filter that spits out
n-grams. But it does not seem to be best solution since one does not know in
advance what n in n-grams should be. Also it means one has to get all these
bi grams and then construct a boolean OR query which is not very efficient

Anyone has tried any alternate approaches ? Appreciate your thoughts. The
use case is for user entered queries when one does interpret or parse them,
but to ensure that the better sub phrases user enters, the better results he


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message