nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MilleBii <mille...@gmail.com>
Subject HTML parsing and charset for Polish
Date Wed, 16 Sep 2009 14:24:28 GMT
Not sure where to look for explanations:

I have a problem with some Polish pages which I can not index properly on
the specific polish characters such as :
&#321;

They are havin the following  charset=windows-1252

Does the HTML parser convert them into their Unicode equivalent ....

-- 
-MilleBii-

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message