lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kundig, Andreas" <andreas.kun...@wipo.int>
Subject HTMLStripCharFilterFactory does not replace &#233;
Date Wed, 18 Nov 2009 11:18:09 GMT
Hello

I indexed an html document with a decimal HTML Entity encodings: the character é (e with
an acute accent) is encoded as &#233; The exact content of the document is:

<html><body>&#231;a va m&#233;m&#233; ?</body></html>

A search for 'mémé' returns no document. If I put the line above in solr admin's analysis.jsp
it also doesn't match mémé. There is only a match if I replace &#233; by é .

This is how I configured the fieldType:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I tried avoiding the problem by using the MappingCharFilterFactory:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I put the file mapping.txt in the conf directory. It contains just this:

"&#233;" => "é"

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message