lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <>
Subject Re: Indexing multiple languages
Date Fri, 03 Jun 2005 11:58:37 GMT

>>> 6/3/2005 6:03:31 AM >>>

On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
> Btw, I did try running the lucene demo (web template) to index the  
> files after I added one including English and Chinese characters.   
> I was
> not able to search for any Chinese in that HTML file (returned no  
> hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search 

> for
> the HTML file if I used English word that appeared in the added HTML
> file.

Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in 

the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two 

files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.


To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message