hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julius Davies" <juliusdav...@cucbc.com>
Subject RE: how to detect charset encoding from "meta http-equiv" ?
Date Sun, 19 Nov 2006 03:46:26 GMT
Hi, Hakka,

According to what I remember of the HTML spec, the first parts of the HTML content (<html><head><meta>...)
should all be basic ascii (bytes 0 - 127).  So you can try reading the first KB or so until
you encounter the <meta> tag.

Then you'll have to re-read with the encoding you've extracted!

I think almost every known encoding supports the lower half of the ascii chart (0 - 127).
 It's only when the first bit of the character is a 1 when things get exciting.

Good luck!

You'll probably need to support all combinations of lower-case and upper-case (since all are
possible in HTML 4):

<meta>
<metA>
<meTa>
<meTA>
<mEta>
<mEtA>
<mETa>
<mETA>
<Meta>
<MetA>
<MeTa>
<MeTA>
<MEta>
<MEtA>
<METa>
<META>

Maybe it's best just to convert whatever you find all to lowercase before trying to extract
the "http equiv".


yours,

Julius

http://juliusdavies.ca/

-----Original Message-----
From:	Hakka Ville [mailto:vhakka@gmail.com]
Sent:	Sat 11/18/2006 4:44 AM
To:	httpclient-user@jakarta.apache.org
Cc:	
Subject:	how to detect charset encoding from "meta http-equiv" ?

Dear Sirs,

I tried to use httpclient, server doesn't set encoding within http response
header, but does in the page itself with "meta http-equiv". How can I tell
httpclient to detect (cyrillic) encoding from that thing ?

Cheers,
Hakka




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org


Mime
View raw message