lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aurora <auror...@gmail.com>
Subject Re: Lucene Unicode Usage
Date Thu, 10 Feb 2005 06:04:35 GMT
So you got a utf8 encoded text file. But how do you read the file into  
Java? The default encoding of Java is likely to be something other than  
utf8. Make sure you specify the encoding like:

   InputStreamReader( new FileInputStream(filename), "UTF-8");


On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore <owen@backspaces.net>  
wrote:

> I'm building an index from a FileMaker database by dumping the data to a  
> tab-separated file.  Because the FileMaker output is encoded in  
> MacRoman, and uses Mac line separators, I run a script across the tab  
> file to clean it up:
> 	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
> This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs  
> (for inter-field CRs) with blanks, and runs a character converter to  
> build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, both  
> of which understand UTF.
>
> BUT -- when I look at the indexes created in Lucene using Luke, I get  
> unprintable letters!  Writing programs to dump the terms (using Writer  
> subclasses which handle unicode correctly) shows that indeed the files  
> now have odd characters when viewed w/ jEdit and BBEdit.
>
> The analyzer used to build the index looks like:
>      public class RedfishAnalyser extends Analyzer {
>        String[] stopwords;
>        public RedfishAnalyser(String[] stopwords) {
>          this.stopwords = stopwords;
>        }
>        public RedfishAnalyser() {
>          this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
>        }
>        public TokenStream tokenStream(String fieldName, Reader reader) {
>          return new PorterStemFilter(
>              new StopFilter(
>                  new LowerCaseFilter(
>                      new StandardFilter(
>                          new StandardTokenizer(reader))),
>                 stopwords));
>        }
>      }
>
> Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about the  
> only place where I can see a problem happening.
>
> Thanks for any pointers,
>
> Owen



-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message