lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Kan <dmitry....@gmail.com>
Subject Re: searching camel cased terms with phrase queries
Date Thu, 08 Nov 2012 19:07:16 GMT
Thanks, Jack. This filter should help for dealing with user input without
clear lexical boundaries. I.e. breaking compound-to-be-words into sub-words
on the query side. It does require still mining the dictionary, but is
doable by some "simple" camel case term frequency analysis.

But would it help really to match with the indexed data?

Tried with solr 4.0.0-BETA (hopefully not too different from stable 4.0
release on this side):

text field in schema (slightly modified "text_general" type by adding WDF
and DCWTF + placing LCF in-between them; english-common-nouns.txt is from
http://www.typo3-media.com/fileadmin/files/wordlists/english-common-nouns.txtwith
word 'rice' removed to make the example below make more sense):


    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="0" catenateWords="1"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="wordlists/english-common-nouns.txt" minWordSize="5"
minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- this filter can remove any duplicate tokens that appear at the
same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming.
-->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="wordlists/english-common-nouns.txt" minWordSize="5"
minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <!-- this filter can remove any duplicate tokens that appear at the
same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming.
-->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>



index:

product for PricewaterhouseCoopers company is this!

query:

"product for Pricewaterhousecoopers company is this!"

I believe no match here according to terms and their positions on the
analysis page. Some misconfiguration? Included DCWTF on the query side as
well as opposed to e.g. to an approach here
http://www.typo3-media.com/blog/solr-noun-expansion.html, so that to
encounter for user no lexical boundary compound words.

-- Dmitry


On Thu, Nov 8, 2012 at 5:04 PM, Jack Krupansky <jack@basetechnology.com>wrote:

> I forgot to mention DictionaryCompoundWordTokenFil**terFactory. It does
> require you to create a dictionary of terms, as opposed to using the terms
> that have been encountered in the index.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Jack Krupansky
> Sent: Wednesday, November 07, 2012 8:14 AM
> To: solr-user@lucene.apache.org
> Subject: Re: searching camel cased terms with phrase queries
>
>
> This is one of those areas of Solr where you can refine and make
> improvements, as you have done, but never actually reach 100% satisfaction.
> And, in some cases, as here, you have a choice of settings and no single
> combination covers all cases.
>
> In this case, you really need compound-term recognition - detecting that
> two
> or more terms have been juxtaposed with no lexical boundary. Google has it,
> and I 'm sure some Solr users have implemented it on their own, but it
> isn't
> in Solr proper, yet.
>
> WDF provides a partial approximation, by generating extra, compound terms
> at
> index time. That works well when ALL of the terms are written together, but
> not when only a subset are written together without lexical boundaries, as
> in your final example.
>
> So, you COULD go the full Google route with a lot of additional effort, or
> accept that you offer only a reasonable approximation. Your choice.
>
> So, pick the approximation which seems "best" and accept that it doesn't
> handle the other cases.
>
> BTW, the proper name is "PricewaterhouseCoopers".
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dmitry Kan
> Sent: Wednesday, November 07, 2012 1:58 AM
> To: solr-user@lucene.apache.org
> Subject: searching camel cased terms with phrase queries
>
> Hello list,
>
> There was a number of threads about handling camel cased words apparently
> in the past (
> http://search-lucene.com/?q=**camel+case&fc_project=Lucene&**
> fc_project=Solr<http://search-lucene.com/?q=camel+case&fc_project=Lucene&fc_project=Solr>
> ).
> Our case is somewhat different from them.
>
> ===================
> Configuration & example
> ===================
>
> To illustrate the issue, let me give you a real example from our data.
> Suppose there is a term in the original text: SmartTV.
>
> If a user wants to type "SmartTV" and "smart tv", we want both to hit the
> original term SmartTV. In order to achieve this, the following filter is
> used in our solr 3.4 schema:
>
> index side:
>
>              <filter class="solr.**WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                preserveOriginal="1"
>                spiltOnCaseChange="1"
>              />
>
> query side:
>
>              <filter class="solr.**WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                preserveOriginal="1"
>                spiltOnCaseChange="1"
>              />
>
> (no differences)
>
> Copying from the analysis page, the index will contain the following terms
> and their positions:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 12 term text SmartTVTV Smart startOffset 05 0
> endOffset 77 5 type <ALPHANUM><ALPHANUM> <ALPHANUM>
>
> (there are tokenizer StandardTokenizerFactory and StandardFilterFactory
> preceeding this filter, but as they didn't affect in this case, their
> output is skipped).
>
> On the query side the query="smart tv" gets processed like:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 12 term text smarttv startOffset 06 endOffset
> 58
> type <ALPHANUM><ALPHANUM>
>
> so there is a match (of course the LowerCaseFilterFactory is configured to
> follow the WordDelimiterFilterFactory to unify the cases for matching) and
> user is happily shooting queries: 'smart tv', 'smarttv' and 'SmartTV'.
>
> ==============================**=====================
> More complex example that doesn't work with the above configuration
> ==============================**=====================
>
> Problems start to occur, if a user types "smarttv for me" against the text
> "SmartTV for me". Here are the index and query analysis excerpts:
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text SmartTVTVforme Smart startOffset
> 05812 0 endOffset 771114 5 type <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM>
> <ALPHANUM>
>
> query:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
> endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
> since in the user query smarttv was written in small case, no split on case
> is triggered and we believe there is no match due to mismatch of the term
> positions ('for' is on the 3rd position in the index and on the 2nd
> position in the query and 'smarttv' and 'for' are not adjacent to satisfy
> the phrase query).
>
>
> =========================
> Config change to fix the problem
> =========================
>
>
> But here catenateWords=1 on indexing side comes at rescue. Which changes
> things to:
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text SmartTVTVforme SmartSmartTV
> startOffset 05812 00 endOffset 771114 57 type
> <ALPHANUM><ALPHANUM><ALPHANUM>
> <ALPHANUM> <ALPHANUM><ALPHANUM>
> query (copying again for comparison purposes):
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
> endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
>
> now there should be a match, because terms 'smarttv', 'for' and 'me' are
> adjacent in the index (ingoring the case differences as
> LowerCaseFilterFactory unifies them for us) and that is what's required by
> the phrase query "smarttv for me".
>
> ====================
> Problem we couldn't solve
> ====================
>
> As we saw above, catenateWords merges maximum run of compound term parts
> into one and aligns the resulting concatenated term with the last term
> part. Illustration with an artificial camel casing:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text PriceWaterHouseCoopersWaterHou*
> *se
> Coopers PricePriceWaterHouseCoopers startOffset 051015 00 endOffset
> 22101522
> 522 type <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM> <ALPHANUM><ALPHANUM>
> The following text and query will not match each other: text='product for
> PriceWaterHouseCoopers company', query="product for PricewaterHouseCoopers
> company":
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234567 term text productfor
> PriceWaterHouseCoopersWaterHou**seCooperscompany
> PricePriceWaterHouseCoopers
> startOffset 081217222735 1212 endOffset 7113422273442 1734 type <ALPHANUM>
> <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM>
> <ALPHANUM>
> query:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123456 term text productfor
> PricewaterHouseCoopersHouseCoo**perscompany Pricewater startOffset
> 1913232836
> 13 endOffset 81235283543 23 type <ALPHANUM><ALPHANUM><ALPHANUM>**
> <ALPHANUM>
> <ALPHANUM><ALPHANUM> <ALPHANUM>
>
> Is there any way to make them match?
>
> Thanks for reading this far.
>
> -dmitry
>



-- 
Regards,

Dmitry Kan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message