lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: lucene and UTF-8
Date Thu, 29 Sep 2005 21:18:05 GMT

: I'm having some problems indexing my UTF-8 html pages. I am running
: lucene on Linux and I cannot understand why does the index generated
: depends on the locale of my operating system.
: If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
: to en_US the index generated will be different. Why is this the case? My
: HTMLs are all UTF-8.

To elaborate a little bit more on the comments other people have made, the
differences you are seeing are most likely related to your JVM using the
LANG variable to determine what the default charset will be when you open
readers.  You should look carefully at how you are opening the HTML files
and reading them in. If you raen't specifying the Charset explicitly in
your code, then you're getting the system default.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message