nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: HTML parsing and charset for Polish
Date Wed, 23 Sep 2009 21:05:01 GMT
Can you provide the HTTP headers and HEAD of the HTML of a Web page
for which Nutch fails? Perhaps there is an inconsistency between HTTP
and META headers or a mispelled codepage? Just a wild guess, but
believe me --  Java does convert fine between Cp1250, Iso8859-2 and
internal UTF-16 so there must be something wrong elsewhere.

Dawid

On Wed, Sep 23, 2009 at 3:09 PM, MilleBii <millebii@gmail.com> wrote:
> At last someone answers.
> Correct CP1250.
> My pages look fine in the browsers of course, but it does not mean Nutch
> handles them properly.
>
> What I'm wondering is if the the nutch HTML parser reads them properly,
> because when I do a search on such characters it fails on pages iso8859-2 or
> cp1250, but not if the page is UTF-8 encoded from what I could see.
> Nutch uses java String (ie Unicode) internally, but I wonder if there would
> a problem in the conversion from the page encoding into the unicode
> encoding.
>
> I did not have time to dig into the details of the matter, I wonder if any
> one has come across the issue and/or solved it.
>
> 2009/9/23 Dawid Weiss <dawid.weiss@gmail.com>
>
>> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
>> course). Check if diacritics like these:
>>
>> ęółąśćżń
>>
>> look all right in the above encodings and use appropriately.
>>
>> Dawid
>>
>> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <millebii@gmail.com> wrote:
>> > same thing when there is
>> > charset=ISO-8859-2
>> >
>> > 2009/9/16 MilleBii <millebii@gmail.com>
>> >
>> >> Not sure where to look for explanations:
>> >>
>> >> I have a problem with some Polish pages which I can not index properly
>> on
>> >> the specific polish characters such as :
>> >> &#321;
>> >>
>> >> They are havin the following  charset=windows-1252
>> >>
>> >> Does the HTML parser convert them into their Unicode equivalent ....
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>
>
>
> --
> -MilleBii-
>

Mime
View raw message