lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mchaput <mcha...@aw.sgi.com>
Subject Re: HTMLParser choking on Unicode
Date Tue, 08 Apr 2003 18:04:45 GMT
Excellent! Thanks very much, Eric. Sorry to the list if this was too 
basic... I'm very new to the world of non-Latin LP.

Cheers!

Eric Isakson wrote:
> I'm using HTML Parser to parse Japanese content with no troubles. Be sure to set the
encoding when you read the HTML files. I have a method I use to get the Reader object:
> 
>     public BufferedReader getReader() throws IOException {
>         InputStream in = getInputStream();
>         return new BufferedReader(new InputStreamReader(in, getCharset()));
>     }
> 
> getInputStream() is getting my input stream from a FileInputStream(File) or a JarFile.getInputStream(JarEntry)
> 
> and
> 
> getCharset() My object keeps track of the language of the content and in my application
all the content for a given language is required to use a specific encoding, so I keep a Hashtable
of language to encoding. For japanese, we use shift_jis as the encoding and things are working
fine.


-- 
                       |
Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com    |   (416) 874-8268
                       |
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message