nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MilleBii <mille...@gmail.com>
Subject Re: HTML parsing and charset for Polish
Date Wed, 23 Sep 2009 13:09:04 GMT
At last someone answers.
Correct CP1250.
My pages look fine in the browsers of course, but it does not mean Nutch
handles them properly.

What I'm wondering is if the the nutch HTML parser reads them properly,
because when I do a search on such characters it fails on pages iso8859-2 or
cp1250, but not if the page is UTF-8 encoded from what I could see.
Nutch uses java String (ie Unicode) internally, but I wonder if there would
a problem in the conversion from the page encoding into the unicode
encoding.

I did not have time to dig into the details of the matter, I wonder if any
one has come across the issue and/or solved it.

2009/9/23 Dawid Weiss <dawid.weiss@gmail.com>

> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
> course). Check if diacritics like these:
>
> ęółąśćżń
>
> look all right in the above encodings and use appropriately.
>
> Dawid
>
> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <millebii@gmail.com> wrote:
> > same thing when there is
> > charset=ISO-8859-2
> >
> > 2009/9/16 MilleBii <millebii@gmail.com>
> >
> >> Not sure where to look for explanations:
> >>
> >> I have a problem with some Polish pages which I can not index properly
> on
> >> the specific polish characters such as :
> >> &#321;
> >>
> >> They are havin the following  charset=windows-1252
> >>
> >> Does the HTML parser convert them into their Unicode equivalent ....
> >>
> >> --
> >> -MilleBii-
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message