lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Juan Pablo Morales" <>
Subject Re: Storing special characters in Lucene
Date Thu, 21 Aug 2008 17:58:13 GMT
On Thu, Aug 21, 2008 at 12:47 PM, Steven A Rowe <> wrote:

> Hola Juan,

Hi Steve

> On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
> > I have an index in Spanish and I use Snowball to stem and
> > analyze and it works perfectly. However, I am running into
> > trouble storing (not indexing, only storing) words that
> > have special characters.
> >
> > That is, I store the special character but the it comes
> > garbled when I read it back.
> > To provide an example:
> >
> > String content = "niños";
> > document.add(new Field("name",content,Store.YES, Index.Tokenized));
> > writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
> If your source code is encoded as Latin-1, then it will likely appear to
> you to be the correct character (depending on the editor/viewer you're using
> and its configuration), but Java may not properly convert it to Unicode,
> depending on the encoding it expects your source code to be in (see the
> -encoding option to javac - if you don't specify it, then the platform
> default encoding is used).  You could test whether this is the problem by
> instead trying:
>   String content = "ni\u00F1os";
>   ...
> > Looking at the index with Luke it shows me "ni&#65533;os" but
> > when I want to see the full text (by right clicking) it shows
> > me ni os.
> &#65533; is the Unicode replacement character (U+FFFD), and it's routinely
> used, including within Lucene itself, as the substitute character for byte
> sequences that are not valid in the designated source encoding.
> > I know Lucene is supposed to store fields in UTF8, but then,
> > how can I make sure I sotre something and get it back just as
> > it was, including special characters?
> Make sure that the data you give to Lucene is encoded properly, and then
> what you get back should also be.
> Please try the suggestion I gave you above ("ni\u00F1os").  If you still
> have the same problem, you may have found a bug - please report back what
> you find.

I just gave my example that string, ant it correctly wrote it on the screen
as an ñ, but storing and retrieving it yielded the same results, that is,
the character gets lost in translation.

Juan Pablo Morales
Ingenian Software ltda

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message