lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Haxby <...@scalix.com>
Subject Re: lucene and UTF-8
Date Thu, 29 Sep 2005 11:06:50 GMT
John Cherouvim wrote:

> I'm having some problems indexing my UTF-8 html pages. I am running 
> lucene on Linux and I cannot understand why does the index generated 
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set 
> this to en_US the index generated will be different. Why is this the 
> case? My HTMLs are all UTF-8.

What verison of Linux are you using?

On Fedora Core 4 (and probably other Fedora's and RHEL)  LANG=el_GR sets 
the character set to ISO 8859-7, eg (on my various machines):

    $ LANG=en_GR date | iconv -f iso88597
    Πεμ Σεπ 29 11:59:19 BST 2005
    $ LANG=el_GR.utf8 date
    Πεμ Σεπ 29 12:01:40 BST 2005

(Everything in FC4 is UTF-8 so it displays right and it seems that the 
Greek for "Sep" is "Sep" -- no surprises there I guess.)

In your case, replacing "date" with whatever the command is that you use 
to generate the indexes should do the right thing.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message