lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Christ" <ochr...@ebscohost.com>
Subject RE: Alternative for WildcardQuery with leading *
Date Fri, 07 Dec 2012 12:13:30 GMT
If I remember correctly it was Baeza-Yates or someone in his group at U Santiago who came up
with the rotated term indexing. 

Indexing "abc", you explicitly mark end of string and index all rotations using a data structure
which supports prefix search (such as a trie):

abc$
bc$a
c$ab
$abc

This index directly supports all prefix, suffix, circumfix, and infix wildcard searches (a*,
*c, a*c, *b*) by transforming the query:

a* 	-> $a		(all terms starting with a)
a*c 	-> c$a	(all terms starting with a and ending in c)
*c	-> c$		(all terms ending with c)
*b*	-> b		(all terms containing b)

You then simply run a prefix search for the transformed query in the trie of rotated strings.
Quite similar to suffix arrays, but adds circumfix support (a*c) which, if I remember correctly,
are not directly supported by standard suffix arrays. One can mark up start and end of the
indexed term if the same trie should also be used for exact match ($abc$).

Cheers, Oli


-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Friday, December 07, 2012 4:34 AM
To: java-user@lucene.apache.org
Subject: RE: Alternative for WildcardQuery with leading *

In general, you seem to need decomposing...: vacancyplan -> tokenized to -> vacancyplan,
vacancy, plan. Wildcards are in general not really a replacement for correct text analysis
on the indexing side. Unfortunately, decomposing is a hard task, but there are dictionary-based
algorithms for e.g. German.

About the question with wildcards: If you really need a wildcard  in front, the common use
case is to also index all terms in backwards order. E.g. Solr does this with ReverseWildCardTokenFilter,
which indexes all terms forward and also in reverse order with some extra "marker-prefix"
on the reverse terms (to be able to differentiate between forwards and backwards terms in
the same field - of course, you could also index the reverse terms into an extra field). The
query parser then automatically produces a simple PrefixQuery with a transformed query term
(so it queries the magic extra reverse terms with a prefix).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Sent: Friday, December 07, 2012 10:16 AM
> To: java-user@lucene.apache.org
> Subject: Alternative for WildcardQuery with leading *
> 
> In order to provide suggestions our query also includes a 
> "WildcardQuery with a leading *", which, of course, has a HUGE 
> performance impact :-(
> 
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he 
> should also be offered "vacancyplan" ...
> 
> How can this feature be implemented without WcQwl* ?
> 
> Any help appreciated
> - Clemens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message