lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content
Date Mon, 31 Dec 2018 03:45:55 GMT
These texts are likely from the original EML file data, but they are not
visible in the content when the EML file is opened in Microsoft Outlook.

I have already applied the HTMLStripFieldUpdateProcessorFactory in
solrconfig.xml, but these texts are still showing up in the index. Below is
my configuration.

<updateRequestProcessorChain name="html-strip-content">

                                <processor
class="solr.HTMLStripFieldUpdateProcessorFactory">

                                              <str
name="fieldName">content_tcs</str>

                                </processor>

                                <processor
class="solr.LogUpdateProcessorFactory" />

                                <processor
class="solr.RunUpdateProcessorFactory" />

</updateRequestProcessorChain>


Regards,
Edwin

On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:

> Specifically, a custome Update Request Processor chain can be used before
> indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> Regards,
>      Alex
>
> On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.damore@gmail.com wrote:
>
> > Hi,
> >
> > I think this kind of text manipulation should be done before indexing, if
> > you have font-size font-family in your text, very likely you’re indexing
> an
> > html with css.
> > If I’m right, you’re just entering in a hell of words that should be
> > removed from your text.
> >
> > On the other hand, if you have to do this at index time, a quick and
> dirty
> > solution is using the pattern-replace filter.
> >
> >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> >
> > Ciao,
> > Vincenzo
> >
> > --
> > mobile: 3498513251
> > skype: free.dev
> >
> > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > I noticed that during the indexing of EMLfiles, there are words like
> > > "*FONT-SIZE:
> > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > well.
> > >
> > > Would like to check, how are we able to remove those words during the
> > > indexing?
> > >
> > > I am using Solr 7.5.0
> > >
> > > Regards,
> > > Edwin
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message