lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: lucene and UTF-8
Date Thu, 29 Sep 2005 11:29:14 GMT
John Cherouvim wrote:
> Hello
> I'm having some problems indexing my UTF-8 html pages. I am running 
> lucene on Linux and I cannot understand why does the index generated 
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this 
> to en_US the index generated will be different. Why is this the case? My 
> HTMLs are all UTF-8.

I think the difference comes from the default character encoding, if the 
page is NOT clearly marked as UTF-8 - then the system has to guess, and 
it guesses differently depending on the current locale.

> Also, is there a lucene index browser? I am currently using Luke, which 
> is good but it doesn't show the Greek UTF-8 from within the index 
> correctly. Is this a matter of a setting in Luke?

It's a matter of setting the appropriate font in Settings.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message