lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <s...@elyograg.org>
Subject Stripping leading/trailing punctuation with SOLR-1653
Date Tue, 31 Aug 2010 14:23:45 GMT
  I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) to 
strip leading and trailing punctuation from terms.  It's not working.  
This was previously discussed here as part of something I was trying 
with WordDelimiterFilterFactory, but I think it needs its own thread now.

I seem to be having two problems, based on what I can see.  The first 
problem is that analysis shows the PatternReplaceCharFilterFactory 
applied in a different order than I have configured it - it's going 
first.  The other problem is that it's eating all my text, leaving any 
fields of that type (which is most of my index!) completely empty.  A 
screenshot showing the issue:

http://www.elyograg.org/punct_analysis.png

Here's my entire fieldType definition, but the same thing happens when I 
replace the pattern with a very basic "([0-9]*)(.*)([0-9]*)" and the 
input value with "9dummy".

<fieldType name="text" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
           replaceWith="$2"
         />
<filter class="solr.WordDelimiterFilterFactory"
           splitOnCaseChange="1"
           splitOnNumerics="1"
           stemEnglishPossessive="1"
           generateWordParts="1"
           generateNumberParts="1"
           catenateWords="1"
           catenateNumbers="1"
           catenateAll="1"
           preserveOriginal="1"
         />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
           replaceWith="$2"
         />
<filter class="solr.WordDelimiterFilterFactory"
           splitOnCaseChange="1"
           splitOnNumerics="1"
           stemEnglishPossessive="1"
           generateWordParts="1"
           generateNumberParts="1"
           catenateWords="0"
           catenateNumbers="0"
           catenateAll="0"
           preserveOriginal="1"
         />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Am I doing something wrong, or is this a bug?

Thanks,
Shawn


Mime
View raw message