lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jennifer May <je...@sino.uni-heidelberg.de>
Subject HTMLParser and Chinese
Date Fri, 14 Sep 2007 10:24:11 GMT
Hello!

I want to index an HTML document with the lucene demo, but have problems 
parsing some Chinese files.

I changed code in the HTMLDocument class as to be able to define the 
encoding of the document to be parsed:
InputStreamReader fis = new InputStreamReader(new FileInputStream(f), 
IndexHTML.encoding);
HTMLParser parser = new HTMLParser(fis);

It works fine for most of my files in GB, Big5 or UTF-8. However, I get 
the following exception for some of my files:
Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53" 
(20307), after : ""

The HTML document looks like this:

<HTML><HEAD><meta http-equiv="Content-Type" content="text/html; charset=GB2312"><TITLE>刘先生(阿成)</TITLE>
<META NAME="keywords" CONTENT="阿成 魂游天国 刘先生">...

Obviously, the Chinese in the meta-tag is the problem. But why? And how 
to solve it?

JTidy parses the same file without errors, but than I have problems with 
the indexing as the JTidyparser takes only InputStreams without 
specified encoding, not InputStreamReaders (at least as far as I found 
out). Even if I convert my file from the original GB to UTF-8 I get only 
gibberish in the Lucene index when using JTidy for parsing.

Thanks in advance for any suggestions either to get around the 
HTMLParser problem or get JTidy to handle different encodings,
Jenny

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message