lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Solr Reference Guide issue for simplified tokenizers
Date Sun, 15 Apr 2018 18:08:15 GMT
On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
> Given example is <analyzer> <tokenizer 
> class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 
> \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes raw 
> unicode characters instead of \t\r\n form, so correct configuration is 
> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 
> &#x9;& #xA;&#xD;]+"/> 

Looks like you're right about that example not working.  I also tried it 
with double backslashes -- something that would be required if the 
string were found in actual java code.  Your suggested replacement DOES 
work -- the characters are encoded with XML syntax and passed as 
ascii/unicode to the constructor for the tokenizer.

I cannot make any sense out of the Lucene RegExp javadoc.  I think it 
needs some full string examples to illustrate what it is trying to say.

I don't think this is a good example for this particular tokenizer, even 
if it's changed to your replacement that does work.  For what the 
example is TRYING to do, WhitespaceTokenizerFactory is a better choice.  
It will match more whitespace characters than spaces, tabs, and newlines.

Here's an example using that tokenizer that will split on semicolon and 
eliminate leading/trailing whitespace from each token:

<analyzer>
   <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
   <filter class="solr.TrimFilterFactory"/>
</analyzer>

Thanks,
Shawn


Mime
View raw message