lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Starts With x and Ends With x Queries
Date Sat, 05 Feb 2005 02:37:44 GMT

: Also keep in mind that QueryParser only allows a trailing asterisk,
: creating a PrefixQuery.  However, if you use a WildcardQuery directly,
: you can use an asterisk as the starting character (at the risk of
: performance).

On the issue of "ends with" wildcard queries, I wanted to throw out and
idea that i've seen used to deal with matches like this in other systems.
I've never acctually tried this with Lucene, but I've seen it used
effectively with other systems where the goal is to "sort" strings by the
least significant (ie: right most) characters first.  I think it could
apply nicely to people who have compelling needs for efficent 'ends with'
queries.



Imagine you have a field call name, which you can already do efficient
prefix matching on using the PrefixQuery class.  Your docs and query may
look something like this...

   D1> name:"Adam Smith" age:13 state:CA ...
   D2> name:"Joe Bob" age:42 state:WA ...
   D3> name:"John Adams" age:35 state:NV ...
   D3> name:"Sue Smith" age:33 state:CA ...

...and your queries may look something like...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));

If you want to start doing suffix queries (ie: all names ending with
"s", or all names ending with "Smith") one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
that starts with a "*". ie...

   Query q3 = new WildcardQuery(new Term("name","*s"));
   Query q4 = new WildcardQuery(new Term("name","*Smith"));

(NOTE: Erik says you can do this, but the docs for WildcardQuery say you
can't I'll assume the docs are wrong and Erik is correct.)

The problem is that this is horrendously inefficient.  In order to find
the docs that contain Terms which match your suffix, WildcardQuery must
first identify what all of those Terms are, by iterating over every Term
in your index to see if they match the suffix.  This is much slower then a
PrefixQuery, or even a WildcardQuery that has just 1 initial character
before a "*" (ie: "s*foobar"), because it can then seek to directly to the
first Term that starts with that character, and also stop iterating as
soon as it encounters a Term that no longer begins with that character.

Which leads me to my point: if you denormalize your data so that you store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when doing
a SuffixQuery...

   D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
   D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
   D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
   D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));
   Query q3 = new PrefixQuery(new Term("rname","s*"));
   Query q4 = new PrefixQuery(new Term("rname","htimS*"));


(If anyone sees a flaw in my theory, please chime in)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message