lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: WordDelimiter filter, expanding to multiple words, unexpected results
Date Tue, 02 Sep 2014 20:26:20 GMT
What happens if you append &debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochkind@jhu.edu> wrote:

> On 9/2/14 1:51 PM, Erick Erickson wrote:
>
>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>> not "macbook"
>>
>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>> have
>> catenate words set.
>>
>> What do you see when you enter these in both the index and query portions
>> of the admin/analysis page?
>>
>
> Thanks Erick!
>
> Our WordDelimiterFilterFactory does have catenate words set, in both index
> and query phases (is that right?):
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> It's hard to cut and paste the results of the analysis page into email (or
> anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
> whole real world app complex field definition. I'll also paste in our
> entire field definition below. But I realize my next step is probably
> creating a simpler isolation/reproduction case (unless you have a magic
> answer from this!).
>
> Again, the problem is that "MacBook" seems to be only matching on indexed
> "macbook" and not indexed "mac book".
>
>
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>
>
> Our entire actual field definition:
>
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>        <!-- the rulefiles thing is to keep ICUTokenizerFactory from
> stripping punctuation,
>             so our synonym filter involving C++ etc can still work.
>             From: https://mail-archives.apache.
> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
> 6070409@elyograg.org%3E
>             the rbbi file is in our local ./conf, copied from lucene
> source tree -->
>        <tokenizer class="solr.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>
>        <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
> ignoreCase="true"/>
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>
>         <!-- folding need sto be after WordDelimiter, so WordDelimiter
>              can do it's thing with full cases and such -->
>         <filter class="solr.ICUFoldingFilterFactory" />
>
>
>         <!-- ICUFolding already includes lowercasing, no
>              need for seperate lowercasing step
>         <filter class="solr.LowerCaseFilterFactory"/>
>         -->
>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message