lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly
Date Sun, 24 Jun 2012 00:11:40 GMT
Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):

<charFilter class="solr.PatternReplaceCharFilterFactory"
            pattern="(\w)\1{2,}+"
            replaceWith="$1$1"/>

The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.

fooobarrrr  =>  foobarr

Using the analysis form, I end up with: fba

Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\w)\1{2,}+"
           replaceWith="$1$1"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:

[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr

Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.

Cheers,
Tim

>>>> Regex.java  <<<<

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
    public static void main(String[] args) throws Exception {
        String toCompile = args[0];
        Pattern p = Pattern.compile(toCompile);
        System.out.println("result: "+processPattern(p, args[1], args[2]));
    }

  // borrowed from PatternReplaceCharFilter.java
  private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
    final Matcher m = pattern.matcher(input);

    final StringBuffer cumulativeOutput = new StringBuffer();
    int cumulative = 0;
    int lastMatchEnd = 0;
    while (m.find()) {
      final int groupSize = m.end() - m.start();
      final int skippedSize = m.start() - lastMatchEnd;
      lastMatchEnd = m.end();
      final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
      m.appendReplacement(cumulativeOutput, replacement);
      final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
      if (groupSize != replacementSize) {
        if (replacementSize < groupSize) {
          cumulative += groupSize - replacementSize;
          int atIndex = lengthBeforeReplacement + replacementSize;
          //System.err.println(atIndex + "!" + cumulative);
          //addOffCorrectMap(atIndex, cumulative);
        }
      }
    }
    m.appendTail(cumulativeOutput);
    return cumulativeOutput;
  }
}

Mime
View raw message