nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Fetched pages has no content
Date Mon, 18 Jul 2011 09:46:00 GMT
Judging from the segment those url's are fetched and parsed. I think maybe 
some HTML parse API's have changed between your 1.1 and 1.2 versions. If 
parserchecker shows the same issue then it's most likey a parse plugin problem 
for the new version. Can you check?

> Hi,
> 
> If you have a look at your regex-ulrfilter.txt it will by default be
> rejecting ? in the URL. Please test with line edited (or commented out) and
> see if the problem fades.
> 
> On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anrask@gmail.com> wrote:
> > Hi Markus!
> > 
> > We are using a custom parser, but I don't think that the problem is in
> > the parsing. I got the same problem when trying the ParserChecker. I
> > also tried the following:
> > 
> > I injected the following seeds:
> > 
> > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > http://www.uu.se/
> > 
> > Then generated a segment, fetched that segment and then did a readseg
> > with -noparse, -noparsedata and -noparsetext.
> > 
> > I have attached the readseg dump and it shows no content for:
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > 
> > Can the problem somehow be in the configurations for the fetcher?
> > 
> > 
> > Best regards,
> > --Anders Rask
> > www.findwise.com
> > 
> > 
> > 2011/7/15 Markus Jelsma <markus.jelsma@openindex.io>
> > 
> >> What parser are you using? What does bin/nutch
> >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content
> >> fine with parse-tika enabled.
> >> 
> >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> >> > Hi!
> >> > 
> >> > We are using Nutch to crawl a bunch of websites and index them to
> >> > Solr.
> >> 
> >> At
> >> 
> >> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch
> >> 
> >> 1.3
> >> 
> >> > and in the same time going from one server to two servers.
> >> > 
> >> > Unfortunately we are stuck with a problem which we haven't seen in the
> >> 
> >> old
> >> 
> >> > environment. Several of the pages that we are fetching contain no
> >> 
> >> content
> >> 
> >> > when they are stored in the segment. The following is an excerpt from
> >> > "readseg" on a segment containing such a page:
> >> > 
> >> > ----
> >> > 
> >> > Recno:: 5
> >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > 
> >> > Content::
> >> > Version: -1
> >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > contentType: text/html
> >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> >> > Connection=close Content-Type=text/html Server=Apache
> >> > Content:
> >> > 
> >> > ----
> >> > 
> >> > The fetch logs say nothing unusual about retrieving this page:
> >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > 
> >> > There seems to be nothing strange about the page itself and a very
> >> 
> >> similar
> >> 
> >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled
> >> 
> >> and
> >> 
> >> > indexed without any problems.
> >> > 
> >> > Anyone have any ideas about what might be wrong here?
> >> > 
> >> > 
> >> > Best regards,
> >> > --Anders Rask
> >> > www.findwise.com
> >> 
> >> --
> >> Markus Jelsma - CTO - Openindex
> >> http://www.linkedin.com/in/markus17
> >> 050-8536620 / 06-50258350

Mime
View raw message