lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 06 Mar 2019 15:27:39 GMT
Hi Paul,

I have tried with the first match pattern to be <str name="pattern">[
\t\x0b\f]*\r?\n</str>, like the configuration below:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t\x0b\f]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the result is still the same as before (previous index results),
with the 4 <br>.

Regards,
Edwin


On Wed, 6 Mar 2019 at 18:23, <paul.dodd@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
> You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>, it’s
> actually the sequence «<br><br>  <br><br>»? So perhaps the first match
> pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>
>
>
> i.e. [space tab vertical-tab formfeed]
>
>
>
> Regards,
>
> Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 07:44
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
> (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
> will actually look for 6 or more <br> instead of 3 <br>,  as we have put
> the <br> two times in the pattern, which is the reason that there are more
> <br> in the result, as cases where there are less than 6 <br> are not being
> replaced, so we ended up having up to 5 <br> in the index.
>
> Modified configuration:
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>
> This will bring us back to the result of the previous index content,
> meaning the issue of having the 4 <br> is still there.
>
> Regards,
> Edwin
>
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
>
> > Hi Paul,
> >
> > Further to my previous email, which there was an extra "}" in the
> > configuration, I have changed to use the below configuration based on
> your
> > suggestion.
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">[ \t]*\r?\n</str>
> >    <str name="replacement">&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the result that I get still has more than 2 <br>. In fact, the
> > result become worse, as you can see from the comparison below.
> >
> > Example 1: The sentence that the regex pattern used to work correctly.
> But
> > with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> > which is wrong.
> > *Original content in EML file:*
> > Dear Sir,
> >
> >
> > I am terminating
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
> terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content in EML file:*
> >
> > *exalted*
> >
> > *Psalm 89:17*
> >
> >
> > 3 Choa Chu Kang Avenue 4
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > <br><br>3 Choa Chu Kang Avenue 4, Singapore
> > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> > Choa Chu Kang Avenue 3, Singapor4
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
> code,
> > there are now 5 <br>
> > *Original content in EML file:*
> >
> > http://www.concorded.com/
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> > 10:07 AM
> > *Previous Index content: *http://www.concorded.com/   <br><br>
> > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> > On Tue, Dec 18, 2018 at 10:07 AM
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Thank you for the reply.
> >>
> >> I have tried to add the following configuration according to your
> >> suggestion:
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">[ \t]*\r?\n}</str>
> >>    <str name="replacement">&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> However, none of the \n is being removed this time round.
> >> Is the order and/or the pattern correct?
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 5 Mar 2019 at 19:54, <paul.dodd@ub.unibe.ch> wrote:
> >>
> >>> Hi Edwin
> >>>
> >>>
> >>>
> >>> Try for the first pattern/replacement
> >>>
> >>>
> >>>
> >>> <str name="pattern">[ \t]*\r?\n</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Now all line endings and preceding whitespace characters should be
> >>> changed to ‘<br>’.
> >>>
> >>>
> >>>
> >>> The second pattern replacement should replace 3 or more ‘<br>’
> sequences
> >>> to 2 ‘<br>’ sequences:
> >>>
> >>>
> >>>
> >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Hope this approach works. Sorry for not replying earlier and best
> >>> regards,
> >>>
> >>> Paul
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>> Gesendet: Dienstag, 5. März 2019 03:35
> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >>>
> >>>
> >>>
> >>> Hi,
> >>>
> >>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Anyone else has other suggestions or have faced the same problem?
> >>> >
> >>> > Regards,
> >>> > Edwin
> >>> >
> >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
> >>> edwinyeozl@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi Paul,
> >>> >>
> >>> >> If I tried to execute the second step first, then I will only get a
> >>> >> single <br> for those with 2 <br>.
> >>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
> >>> >> space in between.
> >>> >>
> >>> >> This is just changing the 2 <br> to be a single <br>, since the
> second
> >>> >> step is to replace with a single <br>.
> >>> >> But it has not solved the underlying problem yet.
> >>> >>
> >>> >> Regards,
> >>> >> Edwin
> >>> >>
> >>> >>
> >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.dodd@ub.unibe.ch> wrote:
> >>> >>
> >>> >>> If the second step is executed first, then you will get the
> unwanted
> >>> 4
> >>> >>> <br>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> für
> >>> >>> Windows 10
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>> \n
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Hi Jörn ,
> >>> >>>
> >>> >>> Do you mean the regex is not correct?
> >>> >>>
> >>> >>> We are already using two RegexReplaceProcessorFactory steps, like
> >>> the one
> >>> >>> shown below. The output that we get is still the same.
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> Regards,
> >>> >>> Edwin
> >>> >>>
> >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com>
> >>> wrote:
> >>> >>>
> >>> >>> > Then you need two regexprocessfactory steps
> >>> >>> >
> >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>> >>> edwinyeozl@gmail.com
> >>> >>> > >:
> >>> >>> > >
> >>> >>> > > Hi,
> >>> >>> > >
> >>> >>> > > Thanks for the reply.
> >>> >>> > >
> >>> >>> > > Do you know of any regex online tool that works correctly for
> >>> Java
> >>> >>> regex?
> >>> >>> > > I tried to find some, but they are not working properly.
> >>> >>> > >
> >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
> >>> >>> single \n
> >>> >>> > > with single <br>.
> >>> >>> > >
> >>> >>> > > Regards,
> >>> >>> > > Edwin
> >>> >>> > >
> >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
> jornfranke@gmail.com
> >>> >
> >>> >>> wrote:
> >>> >>> > >>
> >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
> >>> would
> >>> >>> then
> >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
> Java
> >>> >>> regex
> >>> >>> > for
> >>> >>> > >> your solution.
> >>> >>> > >>
> >>> >>> > >> I believe you want to have 2 regex process factories:
> >>> >>> > >> One that deals with single \n and one that deals with more
> than
> >>> one
> >>> >>> \n
> >>> >>> > >>
> >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>> :
> >>> >>> > >>>
> >>> >>> > >>> Hi,
> >>> >>> > >>>
> >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
> and
> >>> >>> > >>> configuration:
> >>> >>> > >>>
> >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>  <str name="fieldName">content</str>
> >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>  <bool name="literalReplacement">true</bool>
> >>> >>> > >>> </processor>
> >>> >>> > >>>
> >>> >>> > >>> However, the issue is still occurring.
> >>> >>> > >>>
> >>> >>> > >>> Anyone else is able to help?
> >>> >>> > >>>
> >>> >>> > >>> Regards,
> >>> >>> > >>> Edwin
> >>> >>> > >>>
> >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>
> >>> >>> > >>>> Hi,
> >>> >>> > >>>>
> >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
> well.
> >>> >>> > >>>>
> >>> >>> > >>>> Regards,
> >>> >>> > >>>> Edwin
> >>> >>> > >>>>
> >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>>> Hi,
> >>> >>> > >>>>>
> >>> >>> > >>>>> Should we report this as a bug in Solr?
> >>> >>> > >>>>>
> >>> >>> > >>>>> Regards,
> >>> >>> > >>>>> Edwin
> >>> >>> > >>>>>
> >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>>> wrote:
> >>> >>> > >>>>>
> >>> >>> > >>>>>> Hi Paul,
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
> >>> try
> >>> >>> in on
> >>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
> >>> >>> result for
> >>> >>> > >> all
> >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
> >>> not
> >>> >>> more
> >>> >>> > >> than
> >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
> >>> examples).
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regards,
> >>> >>> > >>>>>> Edwin
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >>> >>> > >> edwinyeozl@gmail.com>
> >>> >>> > >>>>>> wrote:
> >>> >>> > >>>>>>
> >>> >>> > >>>>>>> Hi Paul,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e.
> <str
> >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
> regex
> >>> >>> pattern:
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>> </processor>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> However, we are also getting the exact same results as
> the
> >>> >>> earlier
> >>> >>> > >>>>>>> Example 1, 2 and 3.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
> >>> (non
> >>> >>> > >>>>>>> printing) characters than \n, we have find that there are
> >>> no
> >>> >>> non
> >>> >>> > >> printing
> >>> >>> > >>>>>>> characters. It is just next line with a space. You can
> >>> refer
> >>> >>> to the
> >>> >>> > >>>>>>> original content in the same examples below.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
> >>> working
> >>> >>> > >>>>>>> correctly
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>> Dear Sir,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> I am terminating
> >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> terminating
> >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *exalted*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *Psalm 89:17*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>> >>>  \n\n  3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> <br><br>3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >> \n
> >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> >>> \n\n\n  On
> >>> >>> Tue,
> >>> >>> > >> Dec 18,
> >>> >>> > >>>>>>> 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
> >>> have.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Thank you.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Regards,
> >>> >>> > >>>>>>> Edwin
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
> preceed
> >>> >>> the \n
> >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
> >>> >>> characters
> >>> >>> > >>>>>>>> than \n?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >> für
> >>> >>> > >>>>>>>> Windows 10
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org>
> >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple \n
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
> >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>> </processor>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> But we still have exactly the same problem of Example
> 1,2
> >>> and
> >>> >>> 3
> >>> >>> > >> below.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>>  \n\n
> >>> >>> > 3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >>>>>>>> \n \n\n
> >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue, Dec
> >>> >>> > >> 18,
> >>> >>> > >>>>>>>> 2018
> >>> >>> > >>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Any further suggestion?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Thank you.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Regards,
> >>> >>> > >>>>>>>> Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
> >>> failing
> >>> >>> on
> >>> >>> > the
> >>> >>> > >>>>>>>> {2,}
> >>> >>> > >>>>>>>>> part you could try
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> If you also want to match CRLF then
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> >>> >>> > >
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org
> >>> >>> > >>>
> >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thanks for your reply.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> When I use this pattern:
> >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> It is working for some sentence within the same content
> >>> and
> >>> >>> not
> >>> >>> > >>>>>>>> working for
> >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
> >>> working
> >>> >>> and
> >>> >>> > >>>>>>>> another
> >>> >>> > >>>>>>>>> that is not working (partially working):
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> terminating
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>> >  \n\n  3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>  <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> > >> \n\n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n\n
> >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue,
> >>> >>> > Dec
> >>> >>> > >>>>>>>> 18, 2018
> >>> >>> > >>>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thank you.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Regards,
> >>> >>> > >>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
> >>> working. I
> >>> >>> > assume
> >>> >>> > >>>>>>>> nothing
> >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> ??
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Gesendet von Mail<
> >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > >> solr-user@lucene.apache.org
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
> detect
> >>> >>> multiple
> >>> >>> > >> \n
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Hi,
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
> >>> >>> remove
> >>> >>> > more
> >>> >>> > >>>>>>>> than
> >>> >>> > >>>>>>>>> two
> >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n,
> \n
> >>> \n,
> >>> >>> \n
> >>> >>> > \n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n),
> >>> >>> > >>>>>>>>>> and replace it with two <br>.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
> >>> when I
> >>> >>> test
> >>> >>> > it
> >>> >>> > >>>>>>>> in
> >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
> >>> inside
> >>> >>> the
> >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
> >>> >>> instructing
> >>> >>> > the
> >>> >>> > >>>>>>>> regex
> >>> >>> > >>>>>>>>> to
> >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
> >>> instructing
> >>> >>> the
> >>> >>> > >>>>>>>> regex to
> >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
> >>> I do
> >>> >>> it?
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Regards,
> >>> >>> > >>>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message