lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Removing characters like '\n \n' from indexing
Date Tue, 26 May 2015 15:33:42 GMT
Neither - it removes the characters before indexing. The distinction is
that if you remove them during indexing they will still appear in the
stored field values even if they are removed from the indexed values, but
by removing them before indexing, they will not appear in the stored field
values. Again, the distinction is between indexed field values and stored
field values.

-- Jack Krupansky

On Tue, May 26, 2015 at 10:25 AM, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> It is showing up in the search results. Just to confirm, does this
> UpdateProcessor method remove the characters during indexing or only after
> indexing has been done?
>
> Regards,
> Edwin
>
> On 26 May 2015 at 21:30, Upayavira <uv@odoko.co.uk> wrote:
>
> >
> >
> > On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> > > Hi,
> > >
> > > Is there a way to remove the special characters like \n during indexing
> > > of
> > > the rich text documents.
> > >
> > > I have quite alot of leading \n \n in front of my indexed content of
> rich
> > > text documents due to the space and empty lines with the original
> > > documents, and it's causing the content to be flooded with '\n \n' at
> the
> > > start before the actual content comes in. This causes the content to
> look
> > > ugly, and also takes up unnecessary bandwidth in the system.
> >
> > Where is this showing up?
> >
> > If it is in search results, you must use an UpdateProcessor, as these
> > happen before fields are stored (E.g. RegexpReplaceProcessorFactory).
> >
> > If you are concerned about facet results, then you can do it in an
> > analysis chain, for example with a RegexpFilterFactory.
> >
> > Upayavira
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message