lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Indexing multiple languages
Date Fri, 03 Jun 2005 10:03:31 GMT

On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
> Btw, I did try running the lucene demo (web template) to index the  
> files after I added one including English and Chinese characters.   
> I was
> not able to search for any Chinese in that HTML file (returned no  
> hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search  
> for
> the HTML file if I used English word that appeared in the added HTML
> file.

Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in  
the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two  
files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message