nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Liu <>
Subject Re: Charset encoding
Date Wed, 18 May 2005 13:08:33 GMT
Sometimes web pages do not identify the encoding the page is in.  In
these cases, the client has to "guess" the encoding.  Nutch currently
does not have a guessing algorithm, so if it encounters one of these
pages, it just decodes the page using the
parser.character.encoding.default parameter.

Probably the best thing to do is to port over Mozilla's algorithm.  I
know there's a port called jcharset, but I've tested it a few times
and it does not seem very accurate for reasons unknown.  I haven't had
that chance to dig in too deeply into the issue.

On 5/18/05, k-team <> wrote:
> hi guys,
>             we have indexed some pages and noticed that the results of
> the search are not interpreted correctly by our browser. the encoding
> in search.jsp is utf-8 and the browser is set to utf-8 encoding, but
> we obtain strange chars.
> we have also set parser.character.encoding.default in
> nutch-default.xml to utf-8.
> anyone knows what we are missing?
> ciao,
> KTeam

View raw message