nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Fetched pages has no content
Date Fri, 15 Jul 2011 13:41:49 GMT
What parser are you using? What does bin/nutch 
org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine 
with parse-tika enabled.

On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> Hi!
> 
> We are using Nutch to crawl a bunch of websites and index them to Solr. At
> the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
> and in the same time going from one server to two servers.
> 
> Unfortunately we are stuck with a problem which we haven't seen in the old
> environment. Several of the pages that we are fetching contain no content
> when they are stored in the segment. The following is an excerpt from
> "readseg" on a segment containing such a page:
> 
> ----
> 
> Recno:: 5
> URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> 
> Content::
> Version: -1
> url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> contentType: text/html
> metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> Connection=close Content-Type=text/html Server=Apache
> Content:
> 
> ----
> 
> The fetch logs say nothing unusual about retrieving this page:
> 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
> 
> There seems to be nothing strange about the page itself and a very similar
> page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and
> indexed without any problems.
> 
> Anyone have any ideas about what might be wrong here?
> 
> 
> Best regards,
> --Anders Rask
> www.findwise.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message