lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <>
Subject RE: Storing special characters in Lucene
Date Thu, 21 Aug 2008 17:47:00 GMT
Hola Juan,

On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
> I have an index in Spanish and I use Snowball to stem and
> analyze and it works perfectly. However, I am running into
> trouble storing (not indexing, only storing) words that
> have special characters.
> That is, I store the special character but the it comes
> garbled when I read it back.
> To provide an example:
> String content = "niƱos";
> document.add(new Field("name",content,Store.YES, Index.Tokenized));
> writer.addDocument(doc, new SnowballAnalyzer("Spanish"));

If your source code is encoded as Latin-1, then it will likely appear to you to be the correct
character (depending on the editor/viewer you're using and its configuration), but Java may
not properly convert it to Unicode, depending on the encoding it expects your source code
to be in (see the -encoding option to javac - if you don't specify it, then the platform default
encoding is used).  You could test whether this is the problem by instead trying:

   String content = "ni\u00F1os";

> Looking at the index with Luke it shows me "ni&#65533;os" but
> when I want to see the full text (by right clicking) it shows
> me ni os.

&#65533; is the Unicode replacement character (U+FFFD), and it's routinely used, including
within Lucene itself, as the substitute character for byte sequences that are not valid in
the designated source encoding.

> I know Lucene is supposed to store fields in UTF8, but then,
> how can I make sure I sotre something and get it back just as
> it was, including special characters?

Make sure that the data you give to Lucene is encoded properly, and then what you get back
should also be.

Please try the suggestion I gave you above ("ni\u00F1os").  If you still have the same problem,
you may have found a bug - please report back what you find.

View raw message