nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chee Wu" <chee...@gmail.com>
Subject Re: How can I know the Cached Web Charset
Date Thu, 08 Nov 2007 08:59:05 GMT
There are  many Chinese Html pages use UTF-8, so your method might cause the
summary of theses pages to be garbage in your search result, which is very
ugly...

The encodings of  Html pages are deteced by HtmlParser.  Firstly,HtmlParser
will try to find charset meta information in the page head,if this
information doesn't exist,HtmlParser will use default encoding,and default
encoding can be set in Nutch-site.xml.I suggest you don't use default
encoding, just discard the pages whose encoding can't be determined.

You can also to use "jchardet"  to detect encoding of html pages. If charset
encoding can't be  determined by both charset meta data and jchardet, just
discard it.


On Nov 8, 2007 4:09 PM, crossafire <crossany@gmail.com> wrote:

>
> I just crawl some chinese website where Used GB2312 for Web Meta Charset,
> the crawl and search it's OK. But when I want to try the Web Cached It's
> encoding it's error.
> So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp
>
> if (encoding != null) {
>      try {
>        content = new String(bean.getContent(details), encoding);
>      }
>      catch (UnsupportedEncodingException e) {
>        // fallback to windows-1252
>        content = new String(bean.getContent(details), "windows-1252");
>      }
>    }
>    else
>      content = new String(bean.getContent(details), "gb2312");
>  }
>
> that the display Cached web it's Ok, But that just can do for web which
> used
> GB2312
> So it's not a good idear for me.
> I want get the Cached web encoding
> So I try to debug the Cached.jsp like this
> String encoding = (String) metaData.get("CharEncodingForConversion");
> System.out.print(encoding);
> It's debug the encoding is NULL;
>
> Metadata metaData = bean.getParseData(details).getContentMeta();
> String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> System.out.print(contenType);
>
> It's just debug the contenType is text/html
>
> I hope somebody can know how to get The Cachec Web encoding
>
> Thanks
>
>
>
> --
> View this message in context:
> http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13642889
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message