nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arijit <pari...@yahoo.com>
Subject Re: parsechecker fetches url but fetcher fails - happens only in nutch 1.5
Date Wed, 04 Jul 2012 10:12:55 GMT
Hi,
   Ken was right and my assumption was wrong - the issue of fetcher failing is NOT because
of the robots.txt warning. It was happening because I had the seed.txt mentioning the seed
url as : http://en.wikipedia.org/wiki/Districts_of_India/ with a trailing separator. Once
I took that separator out, the fetch and crawl of outlinks went fine!
   But, I was not destined to have all of the cake in 1 go. I upgraded to nutch 1.5 and
tried running the same crawl and it failed. Looking at hadoop,log shows that the robots.txt
fetch is now returning 


======================== hadoop.log snippet============================================================


   2012-07-04 15:12:40,833 INFO  api.RobotRulesParser - Couldn't get robots.txt for http://en.wikipedia.org/wiki/Districts_of_India:
java.io.IOException: unzipBestEffort returned null
2012-07-04 15:12:41,224 INFO  fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
2012-07-04 15:12:41,678 ERROR http.Http - Failed to get protocol output
java.io.IOException: unzipBestEffort returned null
    at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:319)
    at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
    at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
    at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:692)
2012-07-04 15:12:41,680 INFO  fetcher.Fetcher - fetch of http://en.wikipedia.org/wiki/Districts_of_India
failed with: java.io.IOException: unzipBestEffort returned null


==================== hadoop.log snippet ends ==============================================================

And therefore, fetching of the wikipedia url bails out.
I did check that there was a patch for this type of issue on 1.4 - https://issues.apache.org/jira/browse/NUTCH-1089
(though the url is not compressed). However, that change is already in 1.5 - and that cannot
be the source of this problem.
 Any help is much appreciated.

-Arijit

 


________________________________
 From: arijit <parijip@yahoo.com>
To: "user@nutch.apache.org" <user@nutch.apache.org> 
Sent: Tuesday, July 3, 2012 5:28 PM
Subject: Re: parsechecker fetches url but fetcher fails
 

Hi,
   I did some more digging around - and noticed this in the output from readseg:

Recno:: 0
URL:: http://en.wikipedia.org/wiki/Districts_of_India/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jul 03 16:52:09 IST 2012
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1341314531887

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Tue Jul 03 16:52:17 IST 2012
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1341314531887_pst_: notfound(14), lastModified=0:
 http://en.wikipedia.org/wiki/Districts_of_India/

Note the _pst_ : notfound(14)!!!

Does this mean that on fetch the url returns status as 404 and therefore fetch is unable to
carry on....
This will be strange as parsechecker seems to be fine fetching and parsing the links in this
url into outlinks.
So, it might be that the failure to parse the robots.txt is NOT an issue - the issue is that
fetcher stops as it does not get anything when trying to fetch the contents of the url: http://en.wikipedia.org/wiki/Districts_of_India/

Appreciate all the help that has coming my way.
-Arijit



________________________________
 From: Ken Krugler <kkrugler_lists@transpac.com>
To: user@nutch.apache.org 
Sent: Monday, July 2, 2012 10:56 PM
Subject: Re: parsechecker fetches url but fetcher fails
 



On Jul 2, 2012, at 5:00am, arijit wrote:

Hi,
>   Since learning that nutch will be unable to crawl the javascript function calls in
href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.
>    I first tried injecting this URL and follow the step-by-step approach till fetcher
- when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log
and found the following 3 lines - which I believe could be saying that nutch is unable to
parse the robots.txt in the website and ttherefore, fetcher stopped?
>
>    
>
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>   
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules- can't decode
path: /wiki/Wikipedia_talk%3Mediation_Committee/
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
The issue is that the Wikipedia robots.txt file contains malformed URLs - these three are
missing the 'A' from the %3A sequence.


    I tried checking the URL using parsechecker and no issues there! I think it means that
the robots.txt is malformed for this website, which is preventing fetcher from fetching anything.
Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing.
This is an example of where having Nutch use crawler-commons robots.txt parser would help
:)

https://issues.apache.org/jira/browse/NUTCH-1031

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message