lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Merlin Morgenstern <merlin.morgenst...@googlemail.com>
Subject Re: strip html from data
Date Mon, 25 Jul 2011 13:03:18 GMT
sounds logical. I just changed it to the following, restarted and reindexed
with commit:

         <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
                <analyzer type="index">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                </analyzer>
                <analyzer type="query">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                </analyzer>
         </fieldType>

Unfortunatelly that did not fix the error. There are still <h3> tags inside
the data. Although I believe there are viewer then before but I can not
prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsma <markus.jelsma@openindex.io>

> You've three analyzer elements, i wonder what that would do. You need to
> add
> the char filter to the index-time analyzer.
>
> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> > Hi there,
> >
> > I am trying to strip html tags from the data before adding the documents
> to
> > the index. To do that I altered schem.xml like this:
> >
> >          <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >                 <analyzer type="index">
> >                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                     <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >                     <filter class="solr.LowerCaseFilterFactory"/>
> >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> >                     <filter class="solr.PorterStemFilterFactory"/>
> >                 </analyzer>
> >                 <analyzer type="query">
> >                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                     <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >                     <filter class="solr.LowerCaseFilterFactory"/>
> >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> >                     <filter class="solr.PorterStemFilterFactory"/>
> >                 </analyzer>
> >                 <analyzer>
> >                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 </analyzer>
> >          </fieldType>
> >
> >     <fields>
> >         <field name="text" type="text" indexed="true" stored="true"
> > required="false"/>
> >     </fields>
> >
> > Unfortunatelly this does not work, the hmtl tags like <h3> are still
> > present after restarting and reindexing. I also tryed
> > htmlstriptransformer, but this did not work either.
> >
> > Has anybody an idea how to get this done? Thank you in advance for any
> > hint.
> >
> > Merlin
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message