lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Webster Homer <webster.ho...@milliporesigma.com>
Subject RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory
Date Fri, 27 Sep 2019 14:24:28 GMT
I forgot to mention that I'm using Solr 7.2. I also found that if instead of \p{L} I use the
long form \p{Letter} then when I reload the collection after updating the schema, Solr will
not load the collection. I think that Solr's regex support is not standard  Java 8

-----Original Message-----
From: Webster Homer <webster.homer@milliporesigma.com>
Sent: Friday, September 27, 2019 9:09 AM
To: solr-user@lucene.apache.org
Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory

I am developing a new version of a fieldtype that we’ve been using for several years. This
fieldtype is to be used as a part of an autocomplete code. The original version handled standard
ascii characters well, but I wanted it to be able to handle any Unicode letter, not just A-Za-z
but Greek and Chinese as well. The analysis chain is supposed to remove any character that
is not a letter, digit or space.
I settled on this fieldType. The main changes from the old version is that I moved the character
removal from a PatternReplaceFilterFactory call to a PatternReplaceCharFilterFactory. The
problem I’m seeing is in how the two filter factories handle this regex:
([^\p{L}\p{M}\p{Digit} ])
Here is the fieldtype
   <fieldType name="autocomplete_edge_v2" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])"
replacement=" "/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
          <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
       </analyzer>
      <analyzer type="query">
         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])"
replacement=" "/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?" replacement="$1"
replace="all"/>
     </analyzer>
    </fieldType>

The problem I’m seeing is that the call:
         <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />

Strips out letters that match A-Z  It will leave digits, lowercase letters and Chinese characters.
I tested my regex with a couple of online regex testers and it works. It seems that only the
solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see in the Analyzer
Using this test term: 12水3-23-ER1:abc
After the PRCF I see this: 12水323 1 abc
The “ER” is removed. I think this is a bug, or am I doing something wrong.
I used this link as the source for my regex: https://www.regular-expressions.info/unicode.html
It seems that Solr is treating \p{L} as matching lower case ascii characters, but is correct
for other Unicode characters. For letters in the A-Z range it is behaving as if the regex
was \p{Ll}. I tried explicitly adding \p{Lu} in and it made no difference capital letters
were still stripped.

This message and any attachment are confidential and may be privileged or otherwise protected
from disclosure. If you are not the intended recipient, you must not copy this message or
attachment or disclose the contents to any other person. If you have received this transmission
in error, please notify the sender immediately and delete the message and any attachment from
your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability
for any omissions or errors in this message which may arise as a result of E-Mail-transmission
or for damages resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not
guarantee that this message is free of viruses and does not accept liability for any damages
caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access
the German, French, Spanish and Portuguese versions of this disclaimer.
This message and any attachment are confidential and may be privileged or otherwise protected
from disclosure. If you are not the intended recipient, you must not copy this message or
attachment or disclose the contents to any other person. If you have received this transmission
in error, please notify the sender immediately and delete the message and any attachment from
your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability
for any omissions or errors in this message which may arise as a result of E-Mail-transmission
or for damages resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not
guarantee that this message is free of viruses and does not accept liability for any damages
caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access
the German, French, Spanish and Portuguese versions of this disclaimer.

Mime
View raw message