lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen Densmore <>
Subject Re: Lucene Unicode Usage
Date Fri, 11 Feb 2005 22:19:16 GMT
Bingo!  I used the InputStreamReader and that fixed the index.  Boy, 
tough to catch all the holes through which unicode leaks occur!


From: aurora <>
Date: February 9, 2005 11:04:35 PM MST
Subject: Re: Lucene Unicode Usage

So you got a utf8 encoded text file. But how do you read the file into 
Java? The default encoding of Java is likely to be something other than 
utf8. Make sure you specify the encoding like:

   InputStreamReader( new FileInputStream(filename), "UTF-8");

Using Opera's revolutionary e-mail client:

From: Andrzej Bialecki <>
Date: February 10, 2005 2:54:56 AM MST
To: Lucene Users List <>
Subject: Re: Lucene Unicode Usage

Owen Densmore wrote:
> I'm building an index from a FileMaker database by dumping the data to 
> a tab-separated file.  Because the FileMaker output is encoded in 
> MacRoman, and uses Mac line separators, I run a script across the tab 
> file to clean it up:
>     tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
> This basically converts the Mac \r's to \n's, replaces FileMaker's 
> vtabs (for inter-field CRs) with blanks, and runs a character 
> converter to build utf-8 data for Java to use.  Looks fine in jEdit 
> and BBEdit, both of which understand UTF.

However, it matters how you have read in the files in your Java 
application. Did you use InputStreamReader with the default platform 
encoding (probably 8859-1), or did you specify UTF-8 explicitly?

> BUT -- when I look at the indexes created in Lucene using Luke, I get 
> unprintable letters!  Writing programs to dump the terms (using Writer

By default Luke uses the standard platform-specific font "dialog". On 
Windows this font doesn't support Unicode glyphs, so you will see just 
blanks (or rectangles). In the upcoming release you will be able to 
select the display font.

Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message