lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 06 Mar 2019 03:37:07 GMT
Hi Paul,

Further to my previous email, which there was an extra "}" in the
configuration, I have changed to use the below configuration based on your
suggestion.

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the result that I get still has more than 2 <br>. In fact, the
result become worse, as you can see from the comparison below.

Example 1: The sentence that the regex pattern used to work correctly. But
with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
which is wrong.
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Previous Index content: *    Dear Sir,  <br><br>I am terminating
*Current Index content*:   Dear Sir, <br><br><br><br><br> I
am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
<br><br>3 Choa Chu Kang Avenue 4, Singapore
*Current Index content*: <br><br><br>   Psalm 89:17<br><br>
 <br><br>  3
Choa Chu Kang Avenue 3, Singapor4

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
there are now 5 <br>
*Original content in EML file:*

http://www.concorded.com/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
10:07 AM
*Previous Index content: *http://www.concorded.com/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM
*Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
On Tue, Dec 18, 2018 at 10:07 AM


Regards,
Edwin

On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Paul,
>
> Thank you for the reply.
>
> I have tried to add the following configuration according to your
> suggestion:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t]*\r?\n}</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, none of the \n is being removed this time round.
> Is the order and/or the pattern correct?
>
> Regards,
> Edwin
>
> On Tue, 5 Mar 2019 at 19:54, <paul.dodd@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>> Try for the first pattern/replacement
>>
>>
>>
>> <str name="pattern">[ \t]*\r?\n</str>
>>
>> <str name="replacement">&lt;br&gt;</str>
>>
>>
>>
>> Now all line endings and preceding whitespace characters should be
>> changed to ‘<br>’.
>>
>>
>>
>> The second pattern replacement should replace 3 or more ‘<br>’ sequences
>> to 2 ‘<br>’ sequences:
>>
>>
>>
>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>
>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>
>>
>>
>> Hope this approach works. Sorry for not replying earlier and best regards,
>>
>> Paul
>>
>>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> Gesendet: Dienstag, 5. März 2019 03:35
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi,
>>
>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>
>> Regards,
>> Edwin
>>
>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Anyone else has other suggestions or have faced the same problem?
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> If I tried to execute the second step first, then I will only get a
>> >> single <br> for those with 2 <br>.
>> >> For those that we originally get 4 <br>, there will be 2 <br>
with a
>> >> space in between.
>> >>
>> >> This is just changing the 2 <br> to be a single <br>, since
the second
>> >> step is to replace with a single <br>.
>> >> But it has not solved the underlying problem yet.
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On Wed, 20 Feb 2019 at 16:41, <paul.dodd@ub.unibe.ch> wrote:
>> >>
>> >>> If the second step is executed first, then you will get the unwanted
4
>> >>> <br>
>> >>>
>> >>>
>> >>>
>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
für
>> >>> Windows 10
>> >>>
>> >>>
>> >>>
>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> >>>
>> >>>
>> >>>
>> >>> Hi Jörn ,
>> >>>
>> >>> Do you mean the regex is not correct?
>> >>>
>> >>> We are already using two RegexReplaceProcessorFactory steps, like the
>> one
>> >>> shown below. The output that we get is still the same.
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>      <str name="fieldName">content</str>
>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>      <bool name="literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>      <str name="fieldName">content</str>
>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>> >>>      <str name="replacement">&lt;br&gt;</str>
>> >>>      <bool name="literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com>
>> wrote:
>> >>>
>> >>> > Then you need two regexprocessfactory steps
>> >>> >
>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> >>> edwinyeozl@gmail.com
>> >>> > >:
>> >>> > >
>> >>> > > Hi,
>> >>> > >
>> >>> > > Thanks for the reply.
>> >>> > >
>> >>> > > Do you know of any regex online tool that works correctly
for Java
>> >>> regex?
>> >>> > > I tried to find some, but they are not working properly.
>> >>> > >
>> >>> > > Yes, our plan is to replace more than one \n with <br><br>,
and
>> >>> single \n
>> >>> > > with single <br>.
>> >>> > >
>> >>> > > Regards,
>> >>> > > Edwin
>> >>> > >
>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com>
>> >>> wrote:
>> >>> > >>
>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug
- it
>> would
>> >>> then
>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
Java
>> >>> regex
>> >>> > for
>> >>> > >> your solution.
>> >>> > >>
>> >>> > >> I believe you want to have 2 regex process factories:
>> >>> > >> One that deals with single \n and one that deals with
more than
>> one
>> >>> \n
>> >>> > >>
>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo
<
>> >>> > edwinyeozl@gmail.com
>> >>> > >>> :
>> >>> > >>>
>> >>> > >>> Hi,
>> >>> > >>>
>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
and
>> >>> > >>> configuration:
>> >>> > >>>
>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>  <str name="fieldName">content</str>
>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>  <bool name="literalReplacement">true</bool>
>> >>> > >>> </processor>
>> >>> > >>>
>> >>> > >>> However, the issue is still occurring.
>> >>> > >>>
>> >>> > >>> Anyone else is able to help?
>> >>> > >>>
>> >>> > >>> Regards,
>> >>> > >>> Edwin
>> >>> > >>>
>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo
<
>> >>> > edwinyeozl@gmail.com>
>> >>> > >>> wrote:
>> >>> > >>>
>> >>> > >>>> Hi,
>> >>> > >>>>
>> >>> > >>>> For your info, this issue is occurring in Solr
7.7.0 as well.
>> >>> > >>>>
>> >>> > >>>> Regards,
>> >>> > >>>> Edwin
>> >>> > >>>>
>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin
Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>> wrote:
>> >>> > >>>>
>> >>> > >>>>> Hi,
>> >>> > >>>>>
>> >>> > >>>>> Should we report this as a bug in Solr?
>> >>> > >>>>>
>> >>> > >>>>> Regards,
>> >>> > >>>>> Edwin
>> >>> > >>>>>
>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin
Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>>> wrote:
>> >>> > >>>>>
>> >>> > >>>>>> Hi Paul,
>> >>> > >>>>>>
>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we
are using, when we
>> try
>> >>> in on
>> >>> > >>>>>> https://regex101.com/, it is able to give
us the correct
>> >>> result for
>> >>> > >> all
>> >>> > >>>>>> the examples (ie: All of them will only
have <br><br>, and
>> not
>> >>> more
>> >>> > >> than
>> >>> > >>>>>> that like what we are getting in Solr
in our earlier
>> examples).
>> >>> > >>>>>>
>> >>> > >>>>>> Could there be a possibility of a bug
in Solr?
>> >>> > >>>>>>
>> >>> > >>>>>> Regards,
>> >>> > >>>>>> Edwin
>> >>> > >>>>>>
>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin
Edwin Yeo <
>> >>> > >> edwinyeozl@gmail.com>
>> >>> > >>>>>> wrote:
>> >>> > >>>>>>
>> >>> > >>>>>>> Hi Paul,
>> >>> > >>>>>>>
>> >>> > >>>>>>> We have tried it with the space preceeding
the \n i.e. <str
>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>,
with the following regex
>> >>> pattern:
>> >>> > >>>>>>>
>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>> </processor>
>> >>> > >>>>>>>
>> >>> > >>>>>>> However, we are also getting the exact
same results as the
>> >>> earlier
>> >>> > >>>>>>> Example 1, 2 and 3.
>> >>> > >>>>>>>
>> >>> > >>>>>>> As for your point 2 on perhaps in
the data you have other
>> (non
>> >>> > >>>>>>> printing) characters than \n, we have
find that there are no
>> >>> non
>> >>> > >> printing
>> >>> > >>>>>>> characters. It is just next line with
a space. You can refer
>> >>> to the
>> >>> > >>>>>>> original content in the same examples
below.
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 1: The sentence that the above
regex pattern is
>> working
>> >>> > >>>>>>> correctly
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>> Dear Sir,
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> I am terminating
>> >>> > >>>>>>> *Original content:*    Dear Sir, 
\n\n \n \n\n I am
>> terminating
>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 2: The sentence that the above
regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of
2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *exalted*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *Psalm 89:17*
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> >>> > >>>>>>> *Original content:* exalted  \n \n\n
  Psalm 89:17   \n\n
>> >>>  \n\n  3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
>> >>> <br><br>3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 3: The sentence that the above
regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of
2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >> \n
>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n
\n\n \n\n \n\n\n \n\n\n
>> On
>> >>> Tue,
>> >>> > >> Dec 18,
>> >>> > >>>>>>> 2018 at 10:07 AM
>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>> <br><br>On Tue, Dec 18,
2018 at 10:07 AM
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Appreciate any other ideas or suggestions
that you may have.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Thank you.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Regards,
>> >>> > >>>>>>> Edwin
>> >>> > >>>>>>>
>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong,
the space should preceed
>> >>> the \n
>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>> 2.  Perhaps in the data you have
other (non printing)
>> >>> characters
>> >>> > >>>>>>>> than \n?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>> > >> für
>> >>> > >>>>>>>> Windows 10
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar
2019 15:23
>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org>
>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
pattern to detect
>> >>> > >> multiple \n
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Paul,
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> We have tried this suggested regex
pattern as follow:
>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>> </processor>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> But we still have exactly the
same problem of Example 1,2
>> and
>> >>> 3
>> >>> > >> below.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 1: The sentence that the
above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>> *Original content:*    Dear Sir,
 \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>> *Index content: *    Dear Sir,
 <br><br>I am terminating
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 2: The sentence that the
above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2
<br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* exalted  \n
\n\n   Psalm 89:17   \n\n
>> >>>  \n\n
>> >>> > 3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 3: The sentence that the
above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2
<br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >>>>>>>> \n \n\n
>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n
\n\n \n\n \n\n\n \n\n\n  On
>> >>> Tue, Dec
>> >>> > >> 18,
>> >>> > >>>>>>>> 2018
>> >>> > >>>>>>>> at 10:07 AM
>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Any further suggestion?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Thank you.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Regards,
>> >>> > >>>>>>>> Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20,
<paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching
too many \n and then
>> failing
>> >>> on
>> >>> > the
>> >>> > >>>>>>>> {2,}
>> >>> > >>>>>>>>> part you could try
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> If you also want to match
CRLF then
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>> >>> > >
>> >>> > >>>>>>>> für
>> >>> > >>>>>>>>> Windows 10
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar
2019 15:10
>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org
>> >>> > >>>
>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
pattern to
>> detect
>> >>> > >> multiple
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Hi Paul,
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thanks for your reply.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> When I use this pattern:
>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>> </processor>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> It is working for some sentence
within the same content
>> and
>> >>> not
>> >>> > >>>>>>>> working for
>> >>> > >>>>>>>>> some sentences. Please see
below for the one that is
>> working
>> >>> and
>> >>> > >>>>>>>> another
>> >>> > >>>>>>>>> that is not working (partially
working):
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 1: The sentence that
the above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>>> *Original content:*    Dear
Sir,  \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>>> *Index content: *    Dear
Sir,  <br><br>I am terminating
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 2: The sentence that
the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of
2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* exalted
 \n \n\n   Psalm 89:17   \n\n
>> >>> >  \n\n  3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>> *Index content: *exalted 
<br><br>Psalm 89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 3: The sentence that
the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of
2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> > >> \n\n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n\n
>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n
\n\n \n\n \n\n\n \n\n\n  On
>> >>> Tue,
>> >>> > Dec
>> >>> > >>>>>>>> 18, 2018
>> >>> > >>>>>>>>> at 10:07 AM
>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> >>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07
AM
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> We would appreciate your help
to see what is wrong?
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thank you.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Regards,
>> >>> > >>>>>>>>> Edwin
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at
21:24, <paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> You don’t say what happens,
just that it is not working.
>> I
>> >>> > assume
>> >>> > >>>>>>>> nothing
>> >>> > >>>>>>>>>> is replaced? Perhaps the
pattern should be
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> ??
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Gesendet von Mail<
>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>> > >>>>>>>> für
>> >>> > >>>>>>>>>> Windows 10
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>>>> Gesendet: Donnerstag,
7. Februar 2019 14:08
>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > >> solr-user@lucene.apache.org
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory
pattern to detect
>> >>> multiple
>> >>> > >> \n
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Hi,
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am trying to use the
RegexReplaceProcessorFactory to
>> >>> remove
>> >>> > more
>> >>> > >>>>>>>> than
>> >>> > >>>>>>>>> two
>> >>> > >>>>>>>>>> \n with any number of
spaces between them (Eg: \n\n, \n
>> \n,
>> >>> \n
>> >>> > \n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n),
>> >>> > >>>>>>>>>> and replace it with two
<br>.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I use the following regex
pattern and it is working when
>> I
>> >>> test
>> >>> > it
>> >>> > >>>>>>>> in
>> >>> > >>>>>>>>>> regex101.com. But it is
not working when I put it inside
>> >>> the
>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory
as below:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> <updateRequestProcessorChain
name="removeCode">
>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>>> </processor>
>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> To explain further about
my regex pattern, \s* is
>> >>> instructing
>> >>> > the
>> >>> > >>>>>>>> regex
>> >>> > >>>>>>>>> to
>> >>> > >>>>>>>>>> match any \n that have
space after and {2,} is
>> instructing
>> >>> the
>> >>> > >>>>>>>> regex to
>> >>> > >>>>>>>>>> match 2 or more occurrence
of such pattern (\n).
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Please kindly let me know
what is wrong and how should I
>> do
>> >>> it?
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Regards,
>> >>> > >>>>>>>>>> Edwin
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>
>> >>> > >>
>> >>> >
>> >>>
>> >>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message