lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mchaput <>
Subject Japanese search again
Date Wed, 16 Apr 2003 00:51:14 GMT
Sorry, I thought I had this but I've run into another wall...

I'm trying to index shift_JIS encoded Japanese HTML using the demo 

I get an inputstream (FileInputStream or ZipInputStream) from the 
file/zip entry, then I wrap an InputStreamReader around it with a "SJIS" 
encoding type, then wrap a BufferedReader around that.

Then I create a org.apache.lucene.demo.html.HTMLParser with the 

On every file, I get something like this:

  Parse Aborted: Lexical error at line 9, column 47.
  Encountered: "\u8a2d" (35373), after : ""

The error messages for different files have different characters at 
different positions, but it always seems like it's choking on Japanese 
text in HTML attribute values.

I've tried this on our own documents and some Japanese help from WinDVD, 
and get the same problem. Any ideas?

Thanks in advance,


Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7    |   (416) 874-8268
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message