lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: How to index correctly a text save with tinyMCE
Date Thu, 23 Jun 2011 16:51:38 GMT
Hi Ariel,

On 6/23/2011 at 12:34 PM, Ariel wrote:
> But it still doesn't convert the code to the correct character, for
> instance: Espa&amp;ntilde;a must be converted to EspaƱa but it still
> remains as Espa&amp;ntilde;a.

So it looks like your text processing tool(s) escape markup meta-characters (e.g. "&"
-> "&amp;") after escaping above-ASCII characters to their named entity equivalents
(e.g. "n" with a tilde to "&ntilde;").  This two-level escaping appears to be the problem.

According to the analysis.jsp output you sent, your original text "Espa&amp;ntilde;a"
was converted to "Espa&ndilde;a" - the first level of escaping was reversed.

I suspect you could fix the problem by including HTMLStripCharFilter twice, e.g.:

   <charFilter class="solr.HTMLStripCharFilterFactory"/>
   <charFilter class="solr.HTMLStripCharFilterFactory"/>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   ...

Good luck,
Steve

Mime
View raw message