From solr-user-return-146680-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Wed Mar  6 03:37:36 2019
Return-Path: <solr-user-return-146680-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id AF354180648
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  6 Mar 2019 04:37:35 +0100 (CET)
Received: (qmail 24152 invoked by uid 500); 6 Mar 2019 03:37:32 -0000
Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:solr-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:solr-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:solr-user@lucene.apache.org>
List-Id: <solr-user.lucene.apache.org>
Reply-To: solr-user@lucene.apache.org
Delivered-To: mailing list solr-user@lucene.apache.org
Received: (qmail 24140 invoked by uid 99); 6 Mar 2019 03:37:31 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2019 03:37:31 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 52BDE18217C
	for <solr-user@lucene.apache.org>; Wed,  6 Mar 2019 03:37:31 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.53
X-Spam-Level: ***
X-Spam-Status: No, score=3.53 tagged_above=-999 required=6.31
	tests=[DEAR_SOMETHING=1.731, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id sidSizpuIApM for <solr-user@lucene.apache.org>;
	Wed,  6 Mar 2019 03:37:27 +0000 (UTC)
Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 1A1645FE0A
	for <solr-user@lucene.apache.org>; Wed,  6 Mar 2019 03:37:27 +0000 (UTC)
Received: by mail-lj1-f173.google.com with SMTP id d14so9545195ljl.9
        for <solr-user@lucene.apache.org>; Tue, 05 Mar 2019 19:37:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=upSzWJ/WNwJgdITMrVEPJbDki0+DQPsLsnsManfAeNo=;
        b=sO/PPofrEY/N2eBqUB/nz3ZfxqnGYulLcS7+PfcRFEMBuPCz1aD467SRx8pPFYIwHp
         VTs5RmEZLLgG1WjTpktzZlrkWJpqMgHBJIdzTTh/+55enI/eR3Sx4JkVdBcC+whEonMU
         zgPQx6wOkT7RzCM+s9tDgffVoNN0AobfyvBomkUcpSOvpeHoRinafDp+pmdQZ/WOUgDW
         p+id3Vj4i2RU3BECobjdIfSmGiuYWWf7CpeMV6S/sBUSDvI4igZ8YV072sKdM6AtxuZu
         VUfercJyE9w/AXeUtq7/0coojH8Z3eolcfxmPQmvMSyBjY9XeRahYWKYcsGQi1hIlRqS
         4gRA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=upSzWJ/WNwJgdITMrVEPJbDki0+DQPsLsnsManfAeNo=;
        b=C1Qja67Bnl4QVrlnP60eF0yWa31yign3J7Kk+Uvw/+IqrfgeKVbmeREeor2dEzF84Q
         9j8gBeAB0kAEnGr0fMp28zzJOCJBSo+BH2l3N9o5JVjK/KZtfT9hH2nNJ5dUJT5DfwY6
         QodnVVGRB8CrZcRMwVg1K6xEvAtEOjHLXkqqnipdWP1/hmzwTE34UNBG5HtAcCEkB9u5
         fJmz0bCtIJmlC8vCm5NkMohU+dhnQRgsPDH/FfFZIsj38PHeJ5ap7XJLdvSZraU7j5YM
         kf5fdiZeYtwRe6e+Sr15lrA3HjaXiObmfDTT32MuBLOQJB7yYA4bFd8EvYc/Wjr5oSCY
         BGsQ==
X-Gm-Message-State: APjAAAU/DfZBhXKoiAUaaIIE/7CnBHg+YJNhXIaGFi5xqWR6vvalK+DE
	a0/xofAv3Ac4r17J98jX815SSG2Ol3Tzst0VteZ8yn0I
X-Google-Smtp-Source: APXvYqxTnch82iBNmo1rNX66A5ctKkzIANJgEkZggW2MQ+1JviCzt1c2joVdxTcOP0XoNf75Cd9a8nnjG8pgk9Zr3bM=
X-Received: by 2002:a2e:9cda:: with SMTP id g26mr1001389ljj.48.1551843439318;
 Tue, 05 Mar 2019 19:37:19 -0800 (PST)
MIME-Version: 1.0
References: <CAF2DzVVTvm9Pn9+4AvjvR53vZhQJfTejSsOn2=86HB8=fORmpQ@mail.gmail.com>
 <bbc56e84961a4eb58f980165649f6cb1@ub.unibe.ch> <CAF2DzVXD3zcvLst8ztSqCDkxwk7onkXLz5+7ayLjJKyX6BCUUA@mail.gmail.com>
 <d9d3168d531e469c82fb9cbdd5fbc3c1@ub.unibe.ch> <CAF2DzVWfgaVKD3CV0X+TewSgkVr-Vcu=ry=ACr-vUDmX6-m4wA@mail.gmail.com>
 <229624995fcb45259c17334800b38daf@ub.unibe.ch> <CAF2DzVWXyJMR1VQy946erTRzWPnV6uGH6TVaM9TFPufF5a8hkw@mail.gmail.com>
 <CAF2DzVUbf64TqkjjC3Kmp9_K6wcT4CHtcR6OaD83qbp3XjdLnA@mail.gmail.com>
 <CAF2DzVX-7JCQHLnYteNFNr+zf_28Qetvf7qqeFeXEed0e5xzQA@mail.gmail.com>
 <CAF2DzVX-XyR6-yjvaaBsBvpyunwEsKBGFd1wosGLRiWwRsuL8Q@mail.gmail.com>
 <CAF2DzVVGe_qWFQ1_DYrw9ZWVRgo2djZi3OWgW3JD6+BBUKsmhA@mail.gmail.com>
 <36750F82-16A9-4208-9859-6BB16C9EAB2B@gmail.com> <CAF2DzVVu30qu4+PQxHQy7EgepZp0nD=AziDva3Pg7jPQBZbFVw@mail.gmail.com>
 <DB4A0EE6-7A48-4B0B-B1F8-B7BF5EF26E08@gmail.com> <CAF2DzVVucW8TXaQc-RXN6VV6a6ZPSqjzcRt6bKONHNkffGbL-Q@mail.gmail.com>
 <8dd7bdfab0bd4f3082ecf35b4cfb201a@ub.unibe.ch> <CAF2DzVXPekQKA+YjuhR2Yfrf7JqNYLq8BSWW4oFvV5S3RWvcXA@mail.gmail.com>
 <CAF2DzVUZ90XjLE4NLwKwsDHvsa0r4Moy5Ubqytuk00NnL6-jAA@mail.gmail.com>
 <CAF2DzVWH3Lg+Bw70coFkb61bCVvJY5xWUmK+zXhT-LVTq0YGqA@mail.gmail.com>
 <0d5fde6ec1f94119936dacaa0892c104@ub.unibe.ch> <CAF2DzVWXRvNnuvLgWncN252jeDpp0qrCnX-r5=ref+ZciH98BQ@mail.gmail.com>
In-Reply-To: <CAF2DzVWXRvNnuvLgWncN252jeDpp0qrCnX-r5=ref+ZciH98BQ@mail.gmail.com>
From: Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
Date: Wed, 6 Mar 2019 11:37:07 +0800
Message-ID: <CAF2DzVXKj5QRgKoEjg=nx0E+-=12NoHLRTrXwmmykf9r-Np+sg@mail.gmail.com>
Subject: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="000000000000f299c7058364b495"

--000000000000f299c7058364b495
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Paul,

Further to my previous email, which there was an extra "}" in the
configuration, I have changed to use the below configuration based on your
suggestion.

<processor class=3D"solr.RegexReplaceProcessorFactory">
   <str name=3D"fieldName">content</str>
   <str name=3D"pattern">[ \t]*\r?\n</str>
   <str name=3D"replacement">&lt;br&gt;</str>
   <bool name=3D"literalReplacement">true</bool>
</processor>
<processor class=3D"solr.RegexReplaceProcessorFactory">
   <str name=3D"fieldName">content</str>
   <str name=3D"pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
   <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name=3D"literalReplacement">true</bool>
</processor>

However, the result that I get still has more than 2 <br>. In fact, the
result become worse, as you can see from the comparison below.

Example 1: The sentence that the regex pattern used to work correctly. But
with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
which is wrong.
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Previous Index content: *    Dear Sir,  <br><br>I am terminating
*Current Index content*:   Dear Sir, <br><br><br><br><br> I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
<br><br>3 Choa Chu Kang Avenue 4, Singapore
*Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
Choa Chu Kang Avenue 3, Singapor4

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
there are now 5 <br>
*Original content in EML file:*

http://www.concorded.com/


On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
10:07 AM
*Previous Index content: *http://www.concorded.com/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM
*Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
On Tue, Dec 18, 2018 at 10:07 AM


Regards,
Edwin

On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Paul,
>
> Thank you for the reply.
>
> I have tried to add the following configuration according to your
> suggestion:
>
> <processor class=3D"solr.RegexReplaceProcessorFactory">
>    <str name=3D"fieldName">content</str>
>    <str name=3D"pattern">[ \t]*\r?\n}</str>
>    <str name=3D"replacement">&lt;br&gt;</str>
>    <bool name=3D"literalReplacement">true</bool>
> </processor>
>
> <processor class=3D"solr.RegexReplaceProcessorFactory">
>    <str name=3D"fieldName">content</str>
>    <str name=3D"pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name=3D"literalReplacement">true</bool>
> </processor>
>
> However, none of the \n is being removed this time round.
> Is the order and/or the pattern correct?
>
> Regards,
> Edwin
>
> On Tue, 5 Mar 2019 at 19:54, <paul.dodd@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>> Try for the first pattern/replacement
>>
>>
>>
>> <str name=3D"pattern">[ \t]*\r?\n</str>
>>
>> <str name=3D"replacement">&lt;br&gt;</str>
>>
>>
>>
>> Now all line endings and preceding whitespace characters should be
>> changed to =E2=80=98<br>=E2=80=99.
>>
>>
>>
>> The second pattern replacement should replace 3 or more =E2=80=98<br>=E2=
=80=99 sequences
>> to 2 =E2=80=98<br>=E2=80=99 sequences:
>>
>>
>>
>> <str name=3D"pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>
>> <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>>
>>
>>
>> Hope this approach works. Sorry for not replying earlier and best regard=
s,
>>
>> Paul
>>
>>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=3D550986> f=C3=
=BCr
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> Gesendet: Dienstag, 5. M=C3=A4rz 2019 03:35
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi,
>>
>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>
>> Regards,
>> Edwin
>>
>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Anyone else has other suggestions or have faced the same problem?
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinyeozl@gmail.co=
m
>> >
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> If I tried to execute the second step first, then I will only get a
>> >> single <br> for those with 2 <br>.
>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
>> >> space in between.
>> >>
>> >> This is just changing the 2 <br> to be a single <br>, since the secon=
d
>> >> step is to replace with a single <br>.
>> >> But it has not solved the underlying problem yet.
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On Wed, 20 Feb 2019 at 16:41, <paul.dodd@ub.unibe.ch> wrote:
>> >>
>> >>> If the second step is executed first, then you will get the unwanted=
 4
>> >>> <br>
>> >>>
>> >>>
>> >>>
>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=3D550986> =
f=C3=BCr
>> >>> Windows 10
>> >>>
>> >>>
>> >>>
>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> >>>
>> >>>
>> >>>
>> >>> Hi J=C3=B6rn ,
>> >>>
>> >>> Do you mean the regex is not correct?
>> >>>
>> >>> We are already using two RegexReplaceProcessorFactory steps, like th=
e
>> one
>> >>> shown below. The output that we get is still the same.
>> >>>
>> >>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>>      <str name=3D"fieldName">content</str>
>> >>>      <str name=3D"pattern">([ \t]*\r?\n){2,}</str>
>> >>>      <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>      <bool name=3D"literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>>      <str name=3D"fieldName">content</str>
>> >>>      <str name=3D"pattern">([ \t]*\r?\n){1,}</str>
>> >>>      <str name=3D"replacement">&lt;br&gt;</str>
>> >>>      <bool name=3D"literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>> On Wed, 20 Feb 2019 at 16:03, J=C3=B6rn Franke <jornfranke@gmail.com=
>
>> wrote:
>> >>>
>> >>> > Then you need two regexprocessfactory steps
>> >>> >
>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> >>> edwinyeozl@gmail.com
>> >>> > >:
>> >>> > >
>> >>> > > Hi,
>> >>> > >
>> >>> > > Thanks for the reply.
>> >>> > >
>> >>> > > Do you know of any regex online tool that works correctly for Ja=
va
>> >>> regex?
>> >>> > > I tried to find some, but they are not working properly.
>> >>> > >
>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>> >>> single \n
>> >>> > > with single <br>.
>> >>> > >
>> >>> > > Regards,
>> >>> > > Edwin
>> >>> > >
>> >>> > >> On Wed, 20 Feb 2019 at 14:59, J=C3=B6rn Franke <jornfranke@gmai=
l.com>
>> >>> wrote:
>> >>> > >>
>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
>> would
>> >>> then
>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Jav=
a
>> >>> regex
>> >>> > for
>> >>> > >> your solution.
>> >>> > >>
>> >>> > >> I believe you want to have 2 regex process factories:
>> >>> > >> One that deals with single \n and one that deals with more than
>> one
>> >>> \n
>> >>> > >>
>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>> :
>> >>> > >>>
>> >>> > >>> Hi,
>> >>> > >>>
>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>> >>> > >>> configuration:
>> >>> > >>>
>> >>> > >>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>> > >>>  <str name=3D"fieldName">content</str>
>> >>> > >>>  <str name=3D"pattern">([ \t]*\r?\n){2,}</str>
>> >>> > >>>  <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>  <bool name=3D"literalReplacement">true</bool>
>> >>> > >>> </processor>
>> >>> > >>>
>> >>> > >>> However, the issue is still occurring.
>> >>> > >>>
>> >>> > >>> Anyone else is able to help?
>> >>> > >>>
>> >>> > >>> Regards,
>> >>> > >>> Edwin
>> >>> > >>>
>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com>
>> >>> > >>> wrote:
>> >>> > >>>
>> >>> > >>>> Hi,
>> >>> > >>>>
>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> >>> > >>>>
>> >>> > >>>> Regards,
>> >>> > >>>> Edwin
>> >>> > >>>>
>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>> wrote:
>> >>> > >>>>
>> >>> > >>>>> Hi,
>> >>> > >>>>>
>> >>> > >>>>> Should we report this as a bug in Solr?
>> >>> > >>>>>
>> >>> > >>>>> Regards,
>> >>> > >>>>> Edwin
>> >>> > >>>>>
>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>>> wrote:
>> >>> > >>>>>
>> >>> > >>>>>> Hi Paul,
>> >>> > >>>>>>
>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
>> try
>> >>> in on
>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
>> >>> result for
>> >>> > >> all
>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
>> not
>> >>> more
>> >>> > >> than
>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>> examples).
>> >>> > >>>>>>
>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>> >>> > >>>>>>
>> >>> > >>>>>> Regards,
>> >>> > >>>>>> Edwin
>> >>> > >>>>>>
>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> >>> > >> edwinyeozl@gmail.com>
>> >>> > >>>>>> wrote:
>> >>> > >>>>>>
>> >>> > >>>>>>> Hi Paul,
>> >>> > >>>>>>>
>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <st=
r
>> >>> > >>>>>>> name=3D"pattern">(\s*\n){2,}</str>, with the following reg=
ex
>> >>> pattern:
>> >>> > >>>>>>>
>> >>> > >>>>>>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>  <str name=3D"fieldName">content</str>
>> >>> > >>>>>>>  <str name=3D"pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>  <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>> </processor>
>> >>> > >>>>>>>
>> >>> > >>>>>>> However, we are also getting the exact same results as the
>> >>> earlier
>> >>> > >>>>>>> Example 1, 2 and 3.
>> >>> > >>>>>>>
>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
>> (non
>> >>> > >>>>>>> printing) characters than \n, we have find that there are =
no
>> >>> non
>> >>> > >> printing
>> >>> > >>>>>>> characters. It is just next line with a space. You can ref=
er
>> >>> to the
>> >>> > >>>>>>> original content in the same examples below.
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
>> working
>> >>> > >>>>>>> correctly
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>> Dear Sir,
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> I am terminating
>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> terminating
>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *exalted*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *Psalm 89:17*
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >>>  \n\n  3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> <br><br>3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >> \n
>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
>> On
>> >>> Tue,
>> >>> > >> Dec 18,
>> >>> > >>>>>>> 2018 at 10:07 AM
>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may hav=
e.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Thank you.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Regards,
>> >>> > >>>>>>> Edwin
>> >>> > >>>>>>>
>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should precee=
d
>> >>> the \n
>> >>> > >>>>>>>> i.e. <str name=3D"pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>> >>> characters
>> >>> > >>>>>>>> than \n?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=3D550986>
>> >>> > >> f=C3=BCr
>> >>> > >>>>>>>> Windows 10
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org>
>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to dete=
ct
>> >>> > >> multiple \n
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Paul,
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>> >>> > >>>>>>>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>  <str name=3D"fieldName">content</str>
>> >>> > >>>>>>>>  <str name=3D"pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>  <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>> </processor>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
>> and
>> >>> 3
>> >>> > >> below.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >>>  \n\n
>> >>> > 3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >>>>>>>> \n \n\n
>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>> >>> Tue, Dec
>> >>> > >> 18,
>> >>> > >>>>>>>> 2018
>> >>> > >>>>>>>> at 10:07 AM
>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Any further suggestion?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Thank you.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Regards,
>> >>> > >>>>>>>> Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> To avoid the =C2=AB\n+\s*=C2=BB matching too many \n and=
 then
>> failing
>> >>> on
>> >>> > the
>> >>> > >>>>>>>> {2,}
>> >>> > >>>>>>>>> part you could try
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name=3D"pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> If you also want to match CRLF then
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name=3D"pattern">(\r?\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=3D550986
>> >>> > >
>> >>> > >>>>>>>> f=C3=BCr
>> >>> > >>>>>>>>> Windows 10
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org
>> >>> > >>>
>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>> detect
>> >>> > >> multiple
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Hi Paul,
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thanks for your reply.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> When I use this pattern:
>> >>> > >>>>>>>>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>  <str name=3D"fieldName">content</str>
>> >>> > >>>>>>>>>  <str name=3D"pattern">(\n+\s*){2,}</str>
>> >>> > >>>>>>>>>  <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>> </processor>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> It is working for some sentence within the same content
>> and
>> >>> not
>> >>> > >>>>>>>> working for
>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>> working
>> >>> and
>> >>> > >>>>>>>> another
>> >>> > >>>>>>>>> that is not working (partially working):
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\=
n
>> >>> >  \n\n  3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> > >> \n\n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n\n
>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  O=
n
>> >>> Tue,
>> >>> > Dec
>> >>> > >>>>>>>> 18, 2018
>> >>> > >>>>>>>>> at 10:07 AM
>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> >>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thank you.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Regards,
>> >>> > >>>>>>>>> Edwin
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> You don=E2=80=99t say what happens, just that it is not=
 working.
>> I
>> >>> > assume
>> >>> > >>>>>>>> nothing
>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>  <str name=3D"pattern">"(\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> ??
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Gesendet von Mail<
>> >>> > https://go.microsoft.com/fwlink/?LinkId=3D550986>
>> >>> > >>>>>>>> f=C3=BCr
>> >>> > >>>>>>>>>> Windows 10
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > >> solr-user@lucene.apache.org
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>> >>> multiple
>> >>> > >> \n
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Hi,
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>> >>> remove
>> >>> > more
>> >>> > >>>>>>>> than
>> >>> > >>>>>>>>> two
>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
>> \n,
>> >>> \n
>> >>> > \n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n),
>> >>> > >>>>>>>>>> and replace it with two <br>.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I use the following regex pattern and it is working whe=
n
>> I
>> >>> test
>> >>> > it
>> >>> > >>>>>>>> in
>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it insid=
e
>> >>> the
>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> <updateRequestProcessorChain name=3D"removeCode">
>> >>> > >>>>>>>>>> <processor class=3D"solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>>  <str name=3D"fieldName">content</str>
>> >>> > >>>>>>>>>>  <str name=3D"pattern">"(\\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>  <str name=3D"replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>>> </processor>
>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>> >>> instructing
>> >>> > the
>> >>> > >>>>>>>> regex
>> >>> > >>>>>>>>> to
>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>> instructing
>> >>> the
>> >>> > >>>>>>>> regex to
>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should =
I
>> do
>> >>> it?
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Regards,
>> >>> > >>>>>>>>>> Edwin
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>
>> >>> > >>
>> >>> >
>> >>>
>> >>
>>
>

--000000000000f299c7058364b495--