lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: HTMLParser choking on Unicode
Date Tue, 08 Apr 2003 17:23:47 GMT
I'm using HTML Parser to parse Japanese content with no troubles. Be sure to set the encoding
when you read the HTML files. I have a method I use to get the Reader object:

    public BufferedReader getReader() throws IOException {
        InputStream in = getInputStream();
        return new BufferedReader(new InputStreamReader(in, getCharset()));
    }

getInputStream() is getting my input stream from a FileInputStream(File) or a JarFile.getInputStream(JarEntry)

and

getCharset() My object keeps track of the language of the content and in my application all
the content for a given language is required to use a specific encoding, so I keep a Hashtable
of language to encoding. For japanese, we use shift_jis as the encoding and things are working
fine.

If you don't know the encoding of your HTML file up front, you have to do some more work to
determine the encoding before you hand the Reader to HTMLParser.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com

-----Original Message-----
From: mchaput [mailto:mchaput@aw.sgi.com] 
Sent: Tuesday, April 08, 2003 12:55 PM
To: lucene-user@jakarta.apache.org
Subject: HTMLParser choking on Unicode


When I try to index Japanese HTML files using HTMLParser, I just get "lexical errors" in every
file:

   Parse Aborted: Lexical error at line 12, column 28.
   Encountered: "\u2030" (8240), after : ""

Is there something I have to do to make HTMLParser work with Unicode?

(I haven't done anything special with readers or encodings (don't really know much about it)...
is that the problem?)

Thanks,

Matt


-- 
                       |
Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com    |   (416) 874-8268
                       |
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message