lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory
Date Fri, 27 Sep 2019 19:46:47 GMT
Solr’s pattern replace _is_  Java’s. See PatternReplaceCharFilter. You’ll see:

private final Pattern pattern;

and later:
final Matcher m = pattern.matcher(input);

That said, there’s some manipulation after that, so there’s always room for issues. But
I’d try just a standard Java program with your regex to verify rather than online sources.

Best,
Erick

> On Sep 27, 2019, at 2:24 PM, Jörn Franke <jornfranke@gmail.com> wrote:
> 
> Check the log files on the collection reload.
> About your regex: check a web page that checks Java regexes - there can be subtle differences
between Java, JavaScript, php etc.
> Then it could be that your original text is not UTF-8 encoded, but Windows or similar.

> Check also if you have special characters in the text (line breaks, tabs etc.).
> 
>> Am 27.09.2019 um 16:42 schrieb Webster Homer <webster.homer@milliporesigma.com>:
>> 
>> I forgot to mention that I'm using Solr 7.2. I also found that if instead of \p{L}
I use the long form \p{Letter} then when I reload the collection after updating the schema,
Solr will not load the collection. I think that Solr's regex support is not standard  Java
8
>> 
>> -----Original Message-----
>> From: Webster Homer <webster.homer@milliporesigma.com>
>> Sent: Friday, September 27, 2019 9:09 AM
>> To: solr-user@lucene.apache.org
>> Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory
>> 
>> I am developing a new version of a fieldtype that we’ve been using for several
years. This fieldtype is to be used as a part of an autocomplete code. The original version
handled standard ascii characters well, but I wanted it to be able to handle any Unicode letter,
not just A-Za-z but Greek and Chinese as well. The analysis chain is supposed to remove any
character that is not a letter, digit or space.
>> I settled on this fieldType. The main changes from the old version is that I moved
the character removal from a PatternReplaceFilterFactory call to a PatternReplaceCharFilterFactory.
The problem I’m seeing is in how the two filter factories handle this regex:
>> ([^\p{L}\p{M}\p{Digit} ])
>> Here is the fieldtype
>>  <fieldType name="autocomplete_edge_v2" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])"
replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
>>         <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
>>      </analyzer>
>>     <analyzer type="query">
>>        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])"
replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
>>        <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?"
replacement="$1" replace="all"/>
>>    </analyzer>
>>   </fieldType>
>> 
>> The problem I’m seeing is that the call:
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit}
])" replacement="" />
>> 
>> Strips out letters that match A-Z  It will leave digits, lowercase letters and Chinese
characters. I tested my regex with a couple of online regex testers and it works. It seems
that only the solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see in
the Analyzer Using this test term: 12水3-23-ER1:abc
>> After the PRCF I see this: 12水323 1 abc
>> The “ER” is removed. I think this is a bug, or am I doing something wrong.
>> I used this link as the source for my regex: https://www.regular-expressions.info/unicode.html
>> It seems that Solr is treating \p{L} as matching lower case ascii characters, but
is correct for other Unicode characters. For letters in the A-Z range it is behaving as if
the regex was \p{Ll}. I tried explicitly adding \p{Lu} in and it made no difference capital
letters were still stripped.
>> 
>> This message and any attachment are confidential and may be privileged or otherwise
protected from disclosure. If you are not the intended recipient, you must not copy this message
or attachment or disclose the contents to any other person. If you have received this transmission
in error, please notify the sender immediately and delete the message and any attachment from
your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability
for any omissions or errors in this message which may arise as a result of E-Mail-transmission
or for damages resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not
guarantee that this message is free of viruses and does not accept liability for any damages
caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access
the German, French, Spanish and Portuguese versions of this disclaimer.
>> This message and any attachment are confidential and may be privileged or otherwise
protected from disclosure. If you are not the intended recipient, you must not copy this message
or attachment or disclose the contents to any other person. If you have received this transmission
in error, please notify the sender immediately and delete the message and any attachment from
your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability
for any omissions or errors in this message which may arise as a result of E-Mail-transmission
or for damages resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not
guarantee that this message is free of viruses and does not accept liability for any damages
caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access
the German, French, Spanish and Portuguese versions of this disclaimer.


Mime
View raw message