lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gert Brinkmann <g...@netcologne.de>
Subject Re: query with stemming, prefix and fuzzy?
Date Fri, 30 Jan 2009 15:07:42 GMT

Thanks, Mark, for your answer,

Mark Miller wrote:
> Truncation queries and stemming are difficult partners. You likely have
> to accept compromise. You can try using multiple fields like you are,

I already have multiple fields, one per language, to be able to use
different stemmers. Wouldn't become this too much?

> you can try indexing the full term at the same position as the stemmed
> term,

what does this mean "at the same position" and how could I do this?

> or you can accept the weirdness that comes from matching on a
> stemmed form (potentially very confusing for a user).

Currently I think about dropping the stemming and only use
prefix-search. But as highlighting does not work with a prefix "house*"
this is a problem for me. The hint to use "house?*" instead does not
work here.

> In any case though, a queryparser that support fuzzyquery should not be
> analyzing it. What parser are you using? If it is analyzing the fuzzy
> syntax, it doesnt likely support it.

I am using the following definitions (testing it with and without stemming):
>     <fieldType name="text_de_de" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
>         -->
>         <!-- Case insensitive stop word removal.
>              enablePositionIncrements=true ensures that a 'gap' is left to
>              allow for accurate phrase queries.
>         -->
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords_de_de.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> <!-- 	<filter class="solr.SnowballPorterFilterFactory" language="German" />
-->
> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
> 	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de_de.txt" ignoreCase="true"
expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de_de.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> <!-- 	<filter class="solr.SnowballPorterFilterFactory" language="German" />
-->
> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>

and, well, the parser? Where is the parser specified? Do you mean the
request handler "qt" (that will be "standard", as I do not set it yet)?


> The prefix length determines how many terms are enumerated - with the

Can the prefix length be set in Solr? I could not find such an option.

> The latest trunk build on Lucene will let us switch fuzzy query to use a
> constant score mode - this will eliminate the booleanquery and should
> perform much better on a large index. Solr already uses a constant score
> mode for Prefix and Wildcard queries.

much better performance is always good. When will this feature be
available in Solr?

> How big is your index? If its not that big, it may be odd that your
> seeing things that slow (number of unique terms in the index will play a
> large role).

Well, the index currently contains about 5000 documents. These are
HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
linked from the HTML-page) converted to text. The index data is about
11MB (optimized). So think, this is just a smaller index.

Greetings,
Gert

Mime
View raw message