lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting
Date Sat, 03 Sep 2011 19:05:10 GMT
CombiningFilter to recombine tokens into a single token for sorting
-------------------------------------------------------------------

                 Key: LUCENE-3413
                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 2.9.3
            Reporter: Chris A. Mattmann
            Priority: Minor


I whipped up this CombiningFilter for the following use case:

I've got a bunch of titles of e.g., Books, such as:

The Grapes of Wrath
Tommy Tommerson saves the World
Top of the World
The Tales of Beedle the Bard
Born Free

etc.

I want to sort these titles using a String field that includes stopword analysis (e.g., to
remove "The"), and synonym filtering (e.g., for grouping), etc. I created an analysis chain
in Solr for this that was based off of *alphaOnlySort*, which looks like this:

{code:xml}
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
   <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back references to portions of the original
             string matched by the pattern.
             
             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.
             
             http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        /> 
    </analyzer>       
    </fieldType>

{code}

The issue with alphaOnlySort is that it doesn't support stopword remove or synonyms because
those are based on the original token level instead of the full strings produced by the KeywordTokenizer
(which does not do tokenization). I needed a filter that would allow me to change alphaOnlySort
and its analysis chain from using KeywordTokenizer to using WhitespaceTokenizer, and then
a way to recombine the tokens at the end. So, take "The Grapes of Wrath". I needed a way for
it to get turned into:

{noformat}
grapes of wrath
{noformat}

And then to combine those tokens into a single token:

{noformat}
grapesofwrath
{noformat}

The attached CombiningFilter takes care of that. It doesn't do it super efficiently I'm guessing
(since I used a StringBuffer), but I'm open to suggestions on how to make it better. 

One other thing is that apparently this analyzer works fine for analysis (e.g., it produces
the desired tokens), however, for sorting in Solr I'm getting null sort tokens. Need to figure
out why. 

Here ya go!



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message