lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen Densmore <>
Subject Lucene Unicode Usage
Date Thu, 10 Feb 2005 05:32:38 GMT
I'm building an index from a FileMaker database by dumping the data to 
a tab-separated file.  Because the FileMaker output is encoded in 
MacRoman, and uses Mac line separators, I run a script across the tab 
file to clean it up:
	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's 
vtabs (for inter-field CRs) with blanks, and runs a character converter 
to build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, 
both of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get 
unprintable letters!  Writing programs to dump the terms (using Writer 
subclasses which handle unicode correctly) shows that indeed the files 
now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
     public class RedfishAnalyser extends Analyzer {
       String[] stopwords;
       public RedfishAnalyser(String[] stopwords) {
         this.stopwords = stopwords;
       public RedfishAnalyser() {
         this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
       public TokenStream tokenStream(String fieldName, Reader reader) {
         return new PorterStemFilter(
             new StopFilter(
                 new LowerCaseFilter(
                     new StandardFilter(
                         new StandardTokenizer(reader))),

Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about 
the only place where I can see a problem happening.

Thanks for any pointers,


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message