From solr-user-return-146680-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Wed Mar 6 03:37:36 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id AF354180648 for ; Wed, 6 Mar 2019 04:37:35 +0100 (CET) Received: (qmail 24152 invoked by uid 500); 6 Mar 2019 03:37:32 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 24140 invoked by uid 99); 6 Mar 2019 03:37:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2019 03:37:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 52BDE18217C for ; Wed, 6 Mar 2019 03:37:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.53 X-Spam-Level: *** X-Spam-Status: No, score=3.53 tagged_above=-999 required=6.31 tests=[DEAR_SOMETHING=1.731, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id sidSizpuIApM for ; Wed, 6 Mar 2019 03:37:27 +0000 (UTC) Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 1A1645FE0A for ; Wed, 6 Mar 2019 03:37:27 +0000 (UTC) Received: by mail-lj1-f173.google.com with SMTP id d14so9545195ljl.9 for ; Tue, 05 Mar 2019 19:37:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=upSzWJ/WNwJgdITMrVEPJbDki0+DQPsLsnsManfAeNo=; b=sO/PPofrEY/N2eBqUB/nz3ZfxqnGYulLcS7+PfcRFEMBuPCz1aD467SRx8pPFYIwHp VTs5RmEZLLgG1WjTpktzZlrkWJpqMgHBJIdzTTh/+55enI/eR3Sx4JkVdBcC+whEonMU zgPQx6wOkT7RzCM+s9tDgffVoNN0AobfyvBomkUcpSOvpeHoRinafDp+pmdQZ/WOUgDW p+id3Vj4i2RU3BECobjdIfSmGiuYWWf7CpeMV6S/sBUSDvI4igZ8YV072sKdM6AtxuZu VUfercJyE9w/AXeUtq7/0coojH8Z3eolcfxmPQmvMSyBjY9XeRahYWKYcsGQi1hIlRqS 4gRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=upSzWJ/WNwJgdITMrVEPJbDki0+DQPsLsnsManfAeNo=; b=C1Qja67Bnl4QVrlnP60eF0yWa31yign3J7Kk+Uvw/+IqrfgeKVbmeREeor2dEzF84Q 9j8gBeAB0kAEnGr0fMp28zzJOCJBSo+BH2l3N9o5JVjK/KZtfT9hH2nNJ5dUJT5DfwY6 QodnVVGRB8CrZcRMwVg1K6xEvAtEOjHLXkqqnipdWP1/hmzwTE34UNBG5HtAcCEkB9u5 fJmz0bCtIJmlC8vCm5NkMohU+dhnQRgsPDH/FfFZIsj38PHeJ5ap7XJLdvSZraU7j5YM kf5fdiZeYtwRe6e+Sr15lrA3HjaXiObmfDTT32MuBLOQJB7yYA4bFd8EvYc/Wjr5oSCY BGsQ== X-Gm-Message-State: APjAAAU/DfZBhXKoiAUaaIIE/7CnBHg+YJNhXIaGFi5xqWR6vvalK+DE a0/xofAv3Ac4r17J98jX815SSG2Ol3Tzst0VteZ8yn0I X-Google-Smtp-Source: APXvYqxTnch82iBNmo1rNX66A5ctKkzIANJgEkZggW2MQ+1JviCzt1c2joVdxTcOP0XoNf75Cd9a8nnjG8pgk9Zr3bM= X-Received: by 2002:a2e:9cda:: with SMTP id g26mr1001389ljj.48.1551843439318; Tue, 05 Mar 2019 19:37:19 -0800 (PST) MIME-Version: 1.0 References: <229624995fcb45259c17334800b38daf@ub.unibe.ch> <36750F82-16A9-4208-9859-6BB16C9EAB2B@gmail.com> <8dd7bdfab0bd4f3082ecf35b4cfb201a@ub.unibe.ch> <0d5fde6ec1f94119936dacaa0892c104@ub.unibe.ch> In-Reply-To: From: Zheng Lin Edwin Yeo Date: Wed, 6 Mar 2019 11:37:07 +0800 Message-ID: Subject: Re: RegexReplaceProcessorFactory pattern to detect multiple \n To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary="000000000000f299c7058364b495" --000000000000f299c7058364b495 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Paul, Further to my previous email, which there was an extra "}" in the configuration, I have changed to use the below configuration based on your suggestion. content [ \t]*\r?\n <br> true content (<br><br>){3,} <br><br> true However, the result that I get still has more than 2
. In fact, the result become worse, as you can see from the comparison below. Example 1: The sentence that the regex pattern used to work correctly. But with the latest pattern, it has now changed from 2
to become 5
, which is wrong. *Original content in EML file:* Dear Sir, I am terminating *Original content:* Dear Sir, \n\n \n \n\n I am terminating *Previous Index content: * Dear Sir,

I am terminating *Current Index content*: Dear Sir,




I am terminating Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2
, there are 4
) *Original content in EML file:* *exalted* *Psalm 89:17* 3 Choa Chu Kang Avenue 4 *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu Kang Avenue 4, Singapore *Previous Index content: *exalted

Psalm 89:17



3 Choa Chu Kang Avenue 4, Singapore *Current Index content*:


Psalm 89:17



3 Choa Chu Kang Avenue 3, Singapor4 Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2
, there are 4
). For the latest code, there are now 5
*Original content in EML file:* http://www.concorded.com/ On Tue, Dec 18, 2018 at 10:07 AM *Original content:* http://www.concorded.com/ \n\n \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 at 10:07 AM *Previous Index content: *http://www.concorded.com/



On Tue, Dec 18, 2018 at 10:07 AM *Current Index content:* http://www.concorded.com/




On Tue, Dec 18, 2018 at 10:07 AM Regards, Edwin On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo wrote: > Hi Paul, > > Thank you for the reply. > > I have tried to add the following configuration according to your > suggestion: > > > content > [ \t]*\r?\n} > <br> > true > > > > content > (<br><br>){3,} > <br><br> > true > > > However, none of the \n is being removed this time round. > Is the order and/or the pattern correct? > > Regards, > Edwin > > On Tue, 5 Mar 2019 at 19:54, wrote: > >> Hi Edwin >> >> >> >> Try for the first pattern/replacement >> >> >> >> [ \t]*\r?\n >> >> <br> >> >> >> >> Now all line endings and preceding whitespace characters should be >> changed to =E2=80=98
=E2=80=99. >> >> >> >> The second pattern replacement should replace 3 or more =E2=80=98
=E2= =80=99 sequences >> to 2 =E2=80=98
=E2=80=99 sequences: >> >> >> >> (<br><br>){3,} >> >> <br><br> >> >> >> >> Hope this approach works. Sorry for not replying earlier and best regard= s, >> >> Paul >> >> >> >> >> >> Gesendet von Mail f=C3= =BCr >> Windows 10 >> >> >> >> Von: Zheng Lin Edwin Yeo >> Gesendet: Dienstag, 5. M=C3=A4rz 2019 03:35 >> An: solr-user@lucene.apache.org >> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> >> >> >> Hi, >> >> For your info, this issue is occurring in the new Solr 7.7.1 as well. >> >> Regards, >> Edwin >> >> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo >> wrote: >> >> > Hi, >> > >> > Anyone else has other suggestions or have faced the same problem? >> > >> > Regards, >> > Edwin >> > >> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo > > >> > wrote: >> > >> >> Hi Paul, >> >> >> >> If I tried to execute the second step first, then I will only get a >> >> single
for those with 2
. >> >> For those that we originally get 4
, there will be 2
with a >> >> space in between. >> >> >> >> This is just changing the 2
to be a single
, since the secon= d >> >> step is to replace with a single
. >> >> But it has not solved the underlying problem yet. >> >> >> >> Regards, >> >> Edwin >> >> >> >> >> >> On Wed, 20 Feb 2019 at 16:41, wrote: >> >> >> >>> If the second step is executed first, then you will get the unwanted= 4 >> >>>
>> >>> >> >>> >> >>> >> >>> Gesendet von Mail = f=C3=BCr >> >>> Windows 10 >> >>> >> >>> >> >>> >> >>> Von: Zheng Lin Edwin Yeo >> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 >> >>> An: solr-user@lucene.apache.org >> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >> \n >> >>> >> >>> >> >>> >> >>> Hi J=C3=B6rn , >> >>> >> >>> Do you mean the regex is not correct? >> >>> >> >>> We are already using two RegexReplaceProcessorFactory steps, like th= e >> one >> >>> shown below. The output that we get is still the same. >> >>> >> >>> >> >>> content >> >>> ([ \t]*\r?\n){2,} >> >>> <br><br> >> >>> true >> >>> >> >>> >> >>> >> >>> content >> >>> ([ \t]*\r?\n){1,} >> >>> <br> >> >>> true >> >>> >> >>> >> >>> Regards, >> >>> Edwin >> >>> >> >>> On Wed, 20 Feb 2019 at 16:03, J=C3=B6rn Franke >> wrote: >> >>> >> >>> > Then you need two regexprocessfactory steps >> >>> > >> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >> >>> edwinyeozl@gmail.com >> >>> > >: >> >>> > > >> >>> > > Hi, >> >>> > > >> >>> > > Thanks for the reply. >> >>> > > >> >>> > > Do you know of any regex online tool that works correctly for Ja= va >> >>> regex? >> >>> > > I tried to find some, but they are not working properly. >> >>> > > >> >>> > > Yes, our plan is to replace more than one \n with

, and >> >>> single \n >> >>> > > with single
. >> >>> > > >> >>> > > Regards, >> >>> > > Edwin >> >>> > > >> >>> > >> On Wed, 20 Feb 2019 at 14:59, J=C3=B6rn Franke >> >>> wrote: >> >>> > >> >> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it >> would >> >>> then >> >>> > >> be in the JDK. Try out in a regex online Tool that supports Jav= a >> >>> regex >> >>> > for >> >>> > >> your solution. >> >>> > >> >> >>> > >> I believe you want to have 2 regex process factories: >> >>> > >> One that deals with single \n and one that deals with more than >> one >> >>> \n >> >>> > >> >> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >> >>> > edwinyeozl@gmail.com >> >>> > >>> : >> >>> > >>> >> >>> > >>> Hi, >> >>> > >>> >> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and >> >>> > >>> configuration: >> >>> > >>> >> >>> > >>> >> >>> > >>> content >> >>> > >>> ([ \t]*\r?\n){2,} >> >>> > >>> <br><br> >> >>> > >>> true >> >>> > >>> >> >>> > >>> >> >>> > >>> However, the issue is still occurring. >> >>> > >>> >> >>> > >>> Anyone else is able to help? >> >>> > >>> >> >>> > >>> Regards, >> >>> > >>> Edwin >> >>> > >>> >> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >> >>> > edwinyeozl@gmail.com> >> >>> > >>> wrote: >> >>> > >>> >> >>> > >>>> Hi, >> >>> > >>>> >> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. >> >>> > >>>> >> >>> > >>>> Regards, >> >>> > >>>> Edwin >> >>> > >>>> >> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >> >>> > edwinyeozl@gmail.com >> >>> > >>> >> >>> > >>>> wrote: >> >>> > >>>> >> >>> > >>>>> Hi, >> >>> > >>>>> >> >>> > >>>>> Should we report this as a bug in Solr? >> >>> > >>>>> >> >>> > >>>>> Regards, >> >>> > >>>>> Edwin >> >>> > >>>>> >> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >> >>> > edwinyeozl@gmail.com >> >>> > >>> >> >>> > >>>>> wrote: >> >>> > >>>>> >> >>> > >>>>>> Hi Paul, >> >>> > >>>>>> >> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we >> try >> >>> in on >> >>> > >>>>>> https://regex101.com/, it is able to give us the correct >> >>> result for >> >>> > >> all >> >>> > >>>>>> the examples (ie: All of them will only have

, and >> not >> >>> more >> >>> > >> than >> >>> > >>>>>> that like what we are getting in Solr in our earlier >> examples). >> >>> > >>>>>> >> >>> > >>>>>> Could there be a possibility of a bug in Solr? >> >>> > >>>>>> >> >>> > >>>>>> Regards, >> >>> > >>>>>> Edwin >> >>> > >>>>>> >> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >> >>> > >> edwinyeozl@gmail.com> >> >>> > >>>>>> wrote: >> >>> > >>>>>> >> >>> > >>>>>>> Hi Paul, >> >>> > >>>>>>> >> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. > >>> > >>>>>>> name=3D"pattern">(\s*\n){2,}, with the following reg= ex >> >>> pattern: >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> content >> >>> > >>>>>>> (\s*\n){2,} >> >>> > >>>>>>> <br><br> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> However, we are also getting the exact same results as the >> >>> earlier >> >>> > >>>>>>> Example 1, 2 and 3. >> >>> > >>>>>>> >> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other >> (non >> >>> > >>>>>>> printing) characters than \n, we have find that there are = no >> >>> non >> >>> > >> printing >> >>> > >>>>>>> characters. It is just next line with a space. You can ref= er >> >>> to the >> >>> > >>>>>>> original content in the same examples below. >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is >> working >> >>> > >>>>>>> correctly >> >>> > >>>>>>> *Original content in EML file:* >> >>> > >>>>>>> Dear Sir, >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> I am terminating >> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> terminating >> >>> > >>>>>>> *Index content: * Dear Sir,

I am terminating >> >>> > >>>>>>> >> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>> working (as you can see, instead of 2
, there are 4 >>
) >> >>> > >>>>>>> *Original content in EML file:* >> >>> > >>>>>>> >> >>> > >>>>>>> *exalted* >> >>> > >>>>>>> >> >>> > >>>>>>> *Psalm 89:17* >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 >> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> >>> \n\n 3 >> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> >>> > >>>>>>> *Index content: *exalted

Psalm 89:17

>> >>>

3 >> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> >>> > >>>>>>> >> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>> working (as you can see, instead of 2
, there are 4 >>
) >> >>> > >>>>>>> *Original content in EML file:* >> >>> > >>>>>>> >> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >> \n\n >> >>> > \n\n >> >>> > >> \n >> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n >> On >> >>> Tue, >> >>> > >> Dec 18, >> >>> > >>>>>>> 2018 at 10:07 AM >> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>

>> >>> > >>>>>>>

On Tue, Dec 18, 2018 at 10:07 AM >> >>> > >>>>>>> >> >>> > >>>>>>> >> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may hav= e. >> >>> > >>>>>>> >> >>> > >>>>>>> Thank you. >> >>> > >>>>>>> >> >>> > >>>>>>> Regards, >> >>> > >>>>>>> Edwin >> >>> > >>>>>>> >> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, >> wrote: >> >>> > >>>>>>>> >> >>> > >>>>>>>> Hi Edwin >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should precee= d >> >>> the \n >> >>> > >>>>>>>> i.e. (\s*\n){2,} >> >>> > >>>>>>>> 2. Perhaps in the data you have other (non printing) >> >>> characters >> >>> > >>>>>>>> than \n? >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> Gesendet von Mail< >> >>> https://go.microsoft.com/fwlink/?LinkId=3D550986> >> >>> > >> f=C3=BCr >> >>> > >>>>>>>> Windows 10 >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo >> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >> >>> > >>>>>>>> An: solr-user@lucene.apache.org> >>> > solr-user@lucene.apache.org> >> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to dete= ct >> >>> > >> multiple \n >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> Hi Paul, >> >>> > >>>>>>>> >> >>> > >>>>>>>> We have tried this suggested regex pattern as follow: >> >>> > >>>>>>>> >> >>> > >>>>>>>> content >> >>> > >>>>>>>> (\n\s*){2,} >> >>> > >>>>>>>> <br><br> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 >> and >> >>> 3 >> >>> > >> below. >> >>> > >>>>>>>> >> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is >> >>> working >> >>> > >>>>>>>> correctly >> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> >>> terminating >> >>> > >>>>>>>> *Index content: * Dear Sir,

I am terminating >> >>> > >>>>>>>> >> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>>> working >> >>> > >>>>>>>> (as you can see, instead of 2
, there are 4
) >> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> >>> \n\n >> >>> > 3 >> >>> > >>>>>>>> Choa >> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> >>> > >>>>>>>> *Index content: *exalted

Psalm 89:17

>> >>> >

3 >> >>> > >>>>>>>> Choa >> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> >>> > >>>>>>>> >> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>>> working >> >>> > >>>>>>>> (as you can see, instead of 2
, there are 4
) >> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >> \n\n >> >>> > \n\n >> >>> > >>>>>>>> \n \n\n >> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >> >>> Tue, Dec >> >>> > >> 18, >> >>> > >>>>>>>> 2018 >> >>> > >>>>>>>> at 10:07 AM >> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>

>> >>> > >>>>>>>>

On >> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> >>> > >>>>>>>> >> >>> > >>>>>>>> Any further suggestion? >> >>> > >>>>>>>> >> >>> > >>>>>>>> Thank you. >> >>> > >>>>>>>> >> >>> > >>>>>>>> Regards, >> >>> > >>>>>>>> Edwin >> >>> > >>>>>>>> >> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, >> wrote: >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> To avoid the =C2=AB\n+\s*=C2=BB matching too many \n and= then >> failing >> >>> on >> >>> > the >> >>> > >>>>>>>> {2,} >> >>> > >>>>>>>>> part you could try >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> (\n\s*){2,} >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> If you also want to match CRLF then >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> (\r?\n\s*){2,} >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Gesendet von Mail< >> >>> https://go.microsoft.com/fwlink/?LinkId=3D550986 >> >>> > > >> >>> > >>>>>>>> f=C3=BCr >> >>> > >>>>>>>>> Windows 10 >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo >> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >> >>> > >>>>>>>>> An: solr-user@lucene.apache.org> >>> > solr-user@lucene.apache.org >> >>> > >>> >> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to >> detect >> >>> > >> multiple >> >>> > >>>>>>>> \n >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Hi Paul, >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Thanks for your reply. >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> When I use this pattern: >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> content >> >>> > >>>>>>>>> (\n+\s*){2,} >> >>> > >>>>>>>>> <br><br> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> It is working for some sentence within the same content >> and >> >>> not >> >>> > >>>>>>>> working for >> >>> > >>>>>>>>> some sentences. Please see below for the one that is >> working >> >>> and >> >>> > >>>>>>>> another >> >>> > >>>>>>>>> that is not working (partially working): >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is >> >>> working >> >>> > >>>>>>>> correctly >> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> >>> terminating >> >>> > >>>>>>>>> *Index content: * Dear Sir,

I am terminating >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>>> working >> >>> > >>>>>>>>> (as you can see, instead of 2
, there are 4
) >> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\= n >> >>> > \n\n 3 >> >>> > >>>>>>>> Choa >> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> >>> > >>>>>>>>> *Index content: *exalted

Psalm 89:17

>> >>> >

3 >> >>> > >>>>>>>> Choa >> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is >> >>> partially >> >>> > >>>>>>>> working >> >>> > >>>>>>>>> (as you can see, instead of 2
, there are 4
) >> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >> \n\n >> >>> > >> \n\n >> >>> > >>>>>>>> \n >> >>> > >>>>>>>>> \n\n >> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n O= n >> >>> Tue, >> >>> > Dec >> >>> > >>>>>>>> 18, 2018 >> >>> > >>>>>>>>> at 10:07 AM >> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >> >>>

>> >>> > >>>>>>>>

On >> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> We would appreciate your help to see what is wrong? >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Thank you. >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> Regards, >> >>> > >>>>>>>>> Edwin >> >>> > >>>>>>>>> >> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, >> wrote: >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> You don=E2=80=99t say what happens, just that it is not= working. >> I >> >>> > assume >> >>> > >>>>>>>> nothing >> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> "(\n\s*){2,}" >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> ?? >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> Gesendet von Mail< >> >>> > https://go.microsoft.com/fwlink/?LinkId=3D550986> >> >>> > >>>>>>>> f=C3=BCr >> >>> > >>>>>>>>>> Windows 10 >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo >> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org> >>> > >> solr-user@lucene.apache.org >> >>> > >>>>>>>>> >> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect >> >>> multiple >> >>> > >> \n >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> Hi, >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to >> >>> remove >> >>> > more >> >>> > >>>>>>>> than >> >>> > >>>>>>>>> two >> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n >> \n, >> >>> \n >> >>> > \n >> >>> > >>>>>>>> \n >> >>> > >>>>>>>>> \n), >> >>> > >>>>>>>>>> and replace it with two
. >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> I use the following regex pattern and it is working whe= n >> I >> >>> test >> >>> > it >> >>> > >>>>>>>> in >> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it insid= e >> >>> the >> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> content >> >>> > >>>>>>>>>> "(\\n\s*){2,}" >> >>> > >>>>>>>>>> <br><br> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is >> >>> instructing >> >>> > the >> >>> > >>>>>>>> regex >> >>> > >>>>>>>>> to >> >>> > >>>>>>>>>> match any \n that have space after and {2,} is >> instructing >> >>> the >> >>> > >>>>>>>> regex to >> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should = I >> do >> >>> it? >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> I am using Solr 7.6.0. >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> Regards, >> >>> > >>>>>>>>>> Edwin >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>> >> >>> > >> >> >>> > >> >>> >> >> >> > --000000000000f299c7058364b495--