lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: How to remove Scripts and Styles in content of SOLR Indexes[content field] while indexed through URL?
Date Thu, 10 Aug 2017 14:49:52 GMT
Hi Daniel,

HTMLStripCharFilterFactory in your index analyzer should do the trick: <https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.HTMLStripCharFilterFactory>

--
Steve
www.lucidworks.com

> On Aug 10, 2017, at 4:13 AM, Daniel von der Helm <D.vonderHelm@neumueller.com>
wrote:
> 
> Hi,
> if a fetched HTML page (using SimplePostTool: -Ddata=web) contains <script> and
<style> tags inside the <body> tag (not in <head> tag ) the innerText (
i.e. EMAC/JS scripts and CSS styles) of these tags remains as part of document text inside
the "content"/"_text_" field in indexed documents.
> 
> So when I search in _text_ for "push(arguments)", for example, i get a result :(
> Any idea how to remove these unwanted content?
> Using: Solr 6.6.0.
> Solrconfig.xml:
> 
> <requestHandler name="/update/extract"
>                  startup="lazy"
>                  class="solr.extraction.ExtractingRequestHandler" >
>    <lst name="defaults">
>      <str name="lowernames">true</str>
>                 <str name="uprefix">ignored_</str>
>                 <str name="captureAttr">true</str>
>      <str name="fmap.meta">ignored_</str>
>      <str name="fmap.content">plaintext</str>
>    </lst>
>  </requestHandler>
> Thanks in advance
> Daniel
> 


Mime
View raw message