lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: [Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)
Date Wed, 24 May 2017 21:05:10 GMT
Hi Robert,

Two possibilities come to mind:

1. Use a char filter factory (runs before the tokenizer) to convert commas between digits
to spaces, e.g. PatternReplaceCharFilterFactory <https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory>.
2. Use WordDelimiterFilterFactory <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter>

--
Steve
www.lucidworks.com

> On May 24, 2017, at 4:19 PM, Robert Hume <rhume55@gmail.com> wrote:
> 
> Hi,
> 
> Following up on my last email question ... I've learned more and I
> simplified by question ...
> 
> I have a Solr 3.6 deployment.  Currently I'm using
> solr.StandardTokenizerFactory to parse tokens during indexing.
> 
> Here's two example streams that demonstrate my issue:
> 
> Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
> ... which is good.
> 
> Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
> ... which is not good because users can't search by "000123".
> 
> It seems StandardTokenizerFactory treats the "6,000" differently (like it's
> currency or a product number, maybe?) so it doesn't tokenize at the comma.
> 
> QUESTION: How can I enhance StandardTokenizer to do everything it's doing
> now plus produce a couple of additional tokens like this ...
> 
> `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
> 
> ... so users can search by "000123"?
> 
> Thanks!
> Rob


Mime
View raw message