lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject Re: WordDelimiter filter, expanding to multiple words, unexpected results
Date Wed, 03 Sep 2014 15:48:18 GMT
Thanks Erick and Diego. Yes, I noticed in my last message I'm not 
actually using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case, 
I'm getting confusing results that suggest some other part of my field 
def may be pertinent.

I'll come back when I've done that (hopefully next week), and include 
the _parsed_ from &debug=query then. Thanks!

Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:
> What happens if you append &debug=query to your query? IOW, what does the
> _parsed_ query look like?
>
> Also note that the defaults for WDFF are _not_ identical. catenateWords and
> catenateNumbers are 1 in the
> index portion and 0 in the query section. Still, this shouldn't be a
> problem all other things being equal.
>
> Best,
> Erick
>
>
> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochkind@jhu.edu> wrote:
>
>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>
>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>>> not "macbook"
>>>
>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>> have
>>> catenate words set.
>>>
>>> What do you see when you enter these in both the index and query portions
>>> of the admin/analysis page?
>>>
>>
>> Thanks Erick!
>>
>> Our WordDelimiterFilterFactory does have catenate words set, in both index
>> and query phases (is that right?):
>>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>
>> It's hard to cut and paste the results of the analysis page into email (or
>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
>> whole real world app complex field definition. I'll also paste in our
>> entire field definition below. But I realize my next step is probably
>> creating a simpler isolation/reproduction case (unless you have a magic
>> answer from this!).
>>
>> Again, the problem is that "MacBook" seems to be only matching on indexed
>> "macbook" and not indexed "mac book".
>>
>>
>> "MacBook" query analysis:
>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>
>> "MacBook" index analysis:
>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>
>> "mac book" index analysis:
>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>
>>
>> Our entire actual field definition:
>>
>>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
>> autoGeneratePhraseQueries="true">
>>        <analyzer>
>>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>> stripping punctuation,
>>              so our synonym filter involving C++ etc can still work.
>>              From: https://mail-archives.apache.
>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>> 6070409@elyograg.org%3E
>>              the rbbi file is in our local ./conf, copied from lucene
>> source tree -->
>>         <tokenizer class="solr.ICUTokenizerFactory"
>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>
>>         <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
>> ignoreCase="true"/>
>>
>>          <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>
>>
>>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>>               can do it's thing with full cases and such -->
>>          <filter class="solr.ICUFoldingFilterFactory" />
>>
>>
>>          <!-- ICUFolding already includes lowercasing, no
>>               need for seperate lowercasing step
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          -->
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>      </fieldType>
>>
>>
>>
>>
>>
>

Mime
View raw message