nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MilleBii <mille...@gmail.com>
Subject Re: HTML parsing and charset for Polish
Date Wed, 16 Sep 2009 14:47:41 GMT
same thing when there is
charset=ISO-8859-2

2009/9/16 MilleBii <millebii@gmail.com>

> Not sure where to look for explanations:
>
> I have a problem with some Polish pages which I can not index properly on
> the specific polish characters such as :
> &#321;
>
> They are havin the following  charset=windows-1252
>
> Does the HTML parser convert them into their Unicode equivalent ....
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message