lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: query with stemming, prefix and fuzzy?
Date Thu, 29 Jan 2009 17:39:48 GMT
Truncation queries and stemming are difficult partners. You likely have 
to accept compromise. You can try using multiple fields like you are, 
you can try indexing the full term at the same position as the stemmed 
term, or you can accept the weirdness that comes from matching on a 
stemmed form (potentially very confusing for a user).

In any case though, a queryparser that support fuzzyquery should not be 
analyzing it. What parser are you using? If it is analyzing the fuzzy 
syntax, it doesnt likely support it.

Fuzzy queries are slow - especially if they match a lot of terms. A 
booleanquery is created with a clause for each term, and then an edit 
distance is calculated to filter out what doesnt match.

The prefix length determines how many terms are enumerated - with the 
default of 0, every term is enumerated I think. And an edit distance is 
calculated to filter them out. Thats real slow - a longer prefix will 
significantly cut down the number of terms that need to be enumerated.

Think of mark~0.6 - with a 0 prefix I will enumerate every term and 
check the edit distance. With a 2 prefix I will only enumerate the terms 
that start with ma, and calculate an edit distance. One might be just a 
bit faster.

The latest trunk build on Lucene will let us switch fuzzy query to use a 
constant score mode - this will eliminate the booleanquery and should 
perform much better on a large index. Solr already uses a constant score 
mode for Prefix and Wildcard queries.

How big is your index? If its not that big, it may be odd that your 
seeing things that slow (number of unique terms in the index will play a 
large role).

- Mark

Gert Brinkmann wrote:
> Hello,
>
> I am trying to get Solr to properly work. I have set up a Solr test
> server (using jetty as mentioned in the tutorial). Also I had to modify
> the schema.xml so that I have different fields for different languages
> (with their own stemmers) that occur in the content management system
> that I am indexing. So far everything does work fine including snippet
> highlighting.
>
> But now I am having some problems with two things:
>
> A) fuzzy search
>
> When trying to do a fuzzy search the analyzers seem to break up a search
> string like "house~0.6" into "house", "0" and "6" so that e.g. a single
> "6" is highlighted, too. So I tried to use an additional raw-field
> without any stemming and just a lower case and white space analyzer.
> This seems to work fine. But fuzzy query is very slow and takes 100% CPU
> for several seconds with only one query at a time.
>
> What can I do to speed up the fuzzy query? I e.g. have found a Lucene
> parameter prefixLength but no according Solr option. Does this exist?
> Are there some other options to pay attention to?
>
>
> B) combine stemming, prefix and fuzzy search
>
> Is there a way to combine all this three query types in one query?
> Especially stemming and prefixing? I think it would be problematic as a
> "house*" would be analyzed to "house" with the usual analyzers that are
> required for stemming?
>
> Do I need different query type fields and combine them with an boolean
> OR in the query? Something like
>
>   data:house OR data_fuzzy:house~0.6 OR data_prefix:house*
>
> This feels to be a little bit circuitous. Is there a way to use
> "house*~.6" including correct stemming?
>
> Thank you,
> Gert
>   


Mime
View raw message