lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Higgins" <bhiggi...@seattletimes.com>
Subject Problem with camelCase but not casing in general
Date Mon, 07 Jan 2008 22:15:10 GMT
Hi all, I am using a mostly out-of-the-box install of Solr that I'm
using to search through our code repositories.  I've run into a funny
problem where searches for text that is camelCased aren't returning
results unless the casing is exactly the same.  

For example, a query for "getElementById" returns 364 results, but
"getelementbyid" returns 0.

There isn't a problem with all casings, though.  For example, "function"
and "Function" both return the same number of results, as does
"FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
returns only a few results--and it's where the word is actually split up
(e.g. "func tion")!

So it seems that something may be tokenizing words where casing appears
in the middle of them!

How can I get this to stop?

Thanks!

Ben


Here's the definition for the text field type in my schema.xml:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


Mime
View raw message