lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting
Date Sun, 04 Sep 2011 07:00:10 GMT


Simon Willnauer commented on LUCENE-3413:

bq. I'll prepare two patches. One for Lucene that implements your suggestions. And another
for Solr (containing the super trivial factory to instantiate this).
you can do it in one patch :)

> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>                 Key: LUCENE-3413
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 2.9.3
>            Reporter: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LUCENE-3413.Mattmann.090311.2.patch, LUCENE-3413.Mattmann.090311.patch.txt
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword analysis (e.g.,
to remove "The"), and synonym filtering (e.g., for grouping), etc. I created an analysis chain
in Solr for this that was based off of *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>    <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back references to portions of the original
>              string matched by the pattern.
>              See the Java Regular Expression documentation for more
>              information on pattern and replacement string syntax.
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         /> 
>     </analyzer>       
>     </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or synonyms because
those are based on the original token level instead of the full strings produced by the KeywordTokenizer
(which does not do tokenization). I needed a filter that would allow me to change alphaOnlySort
and its analysis chain from using KeywordTokenizer to using WhitespaceTokenizer, and then
a way to recombine the tokens at the end. So, take "The Grapes of Wrath". I needed a way for
it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super efficiently I'm
guessing (since I used a StringBuffer), but I'm open to suggestions on how to make it better.

> One other thing is that apparently this analyzer works fine for analysis (e.g., it produces
the desired tokens), however, for sorting in Solr I'm getting null sort tokens. Need to figure
out why. 
> Here ya go!

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message