nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: HTML parsing and charset for Polish
Date Wed, 23 Sep 2009 12:24:11 GMT
Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
course). Check if diacritics like these:

ęółąśćżń

look all right in the above encodings and use appropriately.

Dawid

On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <millebii@gmail.com> wrote:
> same thing when there is
> charset=ISO-8859-2
>
> 2009/9/16 MilleBii <millebii@gmail.com>
>
>> Not sure where to look for explanations:
>>
>> I have a problem with some Polish pages which I can not index properly on
>> the specific polish characters such as :
>> &#321;
>>
>> They are havin the following  charset=windows-1252
>>
>> Does the HTML parser convert them into their Unicode equivalent ....
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

Mime
View raw message