Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com
 designates 209.85.217.13 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:content-type:content-transfer-encoding;
        b=fyOTX03Y9TSCYUitcPLC6d13gPm4h8QLxOF8l/vN3ZktjyIL3YSBhNPA6Tv+hzG3AG
         UZBYOyZwknrqPDQG/hkPLH4DYe9WQwa80Yc1XjxzyPZW64wbgoN+mGqt4ICmO/Y/kw4n
         hc1lMqplUoHToAz1+A9fza19OlUlt/mCLTLuc=
Message-ID: <498341CA.2020506@gmail.com>
Date: Fri, 30 Jan 2009 13:07:06 -0500
From: Mark Miller <markrmiller@gmail.com>
User-Agent: Thunderbird 2.0.0.19 (X11/20090105)
MIME-Version: 1.0
To: solr-user@lucene.apache.org
Subject: Re: query with stemming, prefix and fuzzy?
References: <497F3AD7.5070300@netcologne.de> <4981E9E4.3020100@gmail.com>
 <498317BE.5000206@netcologne.de>
In-Reply-To: <498317BE.5000206@netcologne.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Gert Brinkmann wrote:
> Thanks, Mark, for your answer,
>
> Mark Miller wrote:
>   
>> Truncation queries and stemming are difficult partners. You likely have
>> to accept compromise. You can try using multiple fields like you are,
>>     
>
> I already have multiple fields, one per language, to be able to use
> different stemmers. Wouldn't become this too much?
>   
Possibly. Especially if you are using norms with all of those fields. 
Depends on your index though.
>   
>> you can try indexing the full term at the same position as the stemmed
>> term,
>>     
>
> what does this mean "at the same position" and how could I do this?
>   
Write a custom filter. Normally, for every term, its position is 
incremented by 1 as the terms are broken out in tokenization. You can 
change this and index terms at the same position using your own filter. 
There are ramifications, because you are adding more terms to your 
index, but it allows you to index multiple forms of a term at the same 
position (so that phrase queries still work as expected).
>   
>> or you can accept the weirdness that comes from matching on a
>> stemmed form (potentially very confusing for a user).
>>     
>
> Currently I think about dropping the stemming and only use
> prefix-search. But as highlighting does not work with a prefix "house*"
> this is a problem for me. The hint to use "house?*" instead does not
> work here.
>   
Thats because wildcard queries are also not highlightable now. I 
actually have somewhat of a solution to this that I'll work on soon 
(I've gotten the ground work for it in or ready to be in Lucene). No 
guarantee on when or if it will be accepted in solr though.
>   
>> In any case though, a queryparser that support fuzzyquery should not be
>> analyzing it. What parser are you using? If it is analyzing the fuzzy
>> syntax, it doesnt likely support it.
>>     
>
> I am using the following definitions (testing it with and without stemming):
>   
>>     <fieldType name="text_de_de" class="solr.TextField" positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <!-- in this example, we will only use synonyms at query time
>>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>         -->
>>         <!-- Case insensitive stop word removal.
>>              enablePositionIncrements=true ensures that a 'gap' is left to
>>              allow for accurate phrase queries.
>>         -->
>>         <filter class="solr.StopFilterFactory"
>>                 ignoreCase="true"
>>                 words="stopwords_de_de.txt"
>>                 enablePositionIncrements="true"
>>                 />
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- 	<filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
>> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
>> 	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de_de.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de_de.txt"/>
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- 	<filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
>> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>     
>
> and, well, the parser? Where is the parser specified? Do you mean the
> request handler "qt" (that will be "standard", as I do not set it yet)?
>   
Thats odd. I'll have to look at this closer to be of help.
>
>   
>> The prefix length determines how many terms are enumerated - with the
>>     
>
> Can the prefix length be set in Solr? I could not find such an option.
>   
I don't think there is an option in Solr. Patches welcome of course. It 
would be a nice one - using the default of 0 is *very* not scalable.
>   
>> The latest trunk build on Lucene will let us switch fuzzy query to use a
>> constant score mode - this will eliminate the booleanquery and should
>> perform much better on a large index. Solr already uses a constant score
>> mode for Prefix and Wildcard queries.
>>     
>
> much better performance is always good. When will this feature be
> available in Solr?
>   
Soon I hope. Since wildcard and prefix are already constant score, it 
only makes sense to make fuzzy query that way as well.
>   
>> How big is your index? If its not that big, it may be odd that your
>> seeing things that slow (number of unique terms in the index will play a
>> large role).
>>     
>
> Well, the index currently contains about 5000 documents. These are
> HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
> linked from the HTML-page) converted to text. The index data is about
> 11MB (optimized). So think, this is just a smaller index.
>   
Yeah, sounds small. Its odd you would see such slow performance. It 
depends though. You may still have a *lot* of unique terms in there.