Mailing-List: contact user-help@nutch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@nutch.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=NLvdNRBlrebo3p4BjXMA7Mb373zD1YleSJghzRu52GIJNdz16I/GA6DtIGrv1mHEAZAyjILUlI84b2ME0JRXmRiLEo2pMbSor8V3ImjTjdZXZRO7hC6VFERt1GK0tZ/+/Sw9uB2CrctkqPIM+0M++ZPfStfS58WBe+hg+bQXerg=;
References: <1341230421.43037.YahooMailNeo@web160806.mail.bf1.yahoo.com>
 <D7876BF3-74AE-41ED-A0C3-C9FFC6EB063E@transpac.com>
 <1341316686.36225.YahooMailNeo@web160806.mail.bf1.yahoo.com>
Message-ID: <1341396775.91424.YahooMailNeo@web160804.mail.bf1.yahoo.com>
Date: Wed, 4 Jul 2012 03:12:55 -0700 (PDT)
From: arijit <parijip@yahoo.com>
Reply-To: arijit <parijip@yahoo.com>
Subject: Re: parsechecker fetches url but fetcher fails - happens only in
 nutch 1.5
To: "user@nutch.apache.org" <user@nutch.apache.org>
In-Reply-To: <1341316686.36225.YahooMailNeo@web160806.mail.bf1.yahoo.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="1124294693-1894354133-1341396775=:91424"

--1124294693-1894354133-1341396775=:91424
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Hi,=0A=A0=A0 Ken was right and my assumption was wrong - the issue of fetch=
er failing is NOT because of the robots.txt warning. It was happening becau=
se I had the seed.txt mentioning the seed url as : http://en.wikipedia.org/=
wiki/Districts_of_India/ with a trailing separator. Once I took that separa=
tor out, the fetch and crawl of outlinks went fine!=0A=A0=A0 But, I was not=
 destined to have all of the cake in 1 go. I upgraded to nutch 1.5 and trie=
d running the same crawl and it failed. Looking at hadoop,log shows that th=
e robots.txt fetch is now returning =0A=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hadoop.log snippet=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=0A=0A=0A=A0=A0 2012-07-04 15:12:40,833 INFO=A0 api.Robot=
RulesParser - Couldn't get robots.txt for http://en.wikipedia.org/wiki/Dist=
ricts_of_India: java.io.IOException: unzipBestEffort returned null=0A2012-0=
7-04 15:12:41,224 INFO=A0 fetcher.Fetcher - -activeThreads=3D1, spinWaiting=
=3D0, fetchQueues.totalSize=3D0=0A2012-07-04 15:12:41,678 ERROR http.Http -=
 Failed to get protocol output=0Ajava.io.IOException: unzipBestEffort retur=
ned null=0A=A0=A0=A0 at org.apache.nutch.protocol.http.api.HttpBase.process=
GzipEncoded(HttpBase.java:319)=0A=A0=A0=A0 at org.apache.nutch.protocol.htt=
p.HttpResponse.<init>(HttpResponse.java:162)=0A=A0=A0=A0 at org.apache.nutc=
h.protocol.http.Http.getResponse(Http.java:64)=0A=A0=A0=A0 at org.apache.nu=
tch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)=0A=A0=
=A0=A0 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:6=
92)=0A2012-07-04 15:12:41,680 INFO=A0 fetcher.Fetcher - fetch of http://en.=
wikipedia.org/wiki/Districts_of_India failed with: java.io.IOException: unz=
ipBestEffort returned null=0A=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D hadoop.log snippet ends =3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=0A=0AAnd therefore, fetching of the wikipedia url bails out.=0AI =
did check that there was a patch for this type of issue on 1.4 - https://is=
sues.apache.org/jira/browse/NUTCH-1089 (though the url is not compressed). =
However, that change is already in 1.5 - and that cannot be the source of t=
his problem.=0A=A0Any help is much appreciated.=0A=0A-Arijit=0A=0A=A0=0A=0A=
=0A________________________________=0A From: arijit <parijip@yahoo.com>=0AT=
o: "user@nutch.apache.org" <user@nutch.apache.org> =0ASent: Tuesday, July 3=
, 2012 5:28 PM=0ASubject: Re: parsechecker fetches url but fetcher fails=0A=
 =0A=0AHi,=0A=A0=A0 I did some more digging around - and noticed this in th=
e output from readseg:=0A=0ARecno:: 0=0AURL:: http://en.wikipedia.org/wiki/=
Districts_of_India/=0A=0ACrawlDatum::=0AVersion: 7=0AStatus: 1 (db_unfetche=
d)=0AFetch time: Tue Jul 03 16:52:09 IST 2012=0AModified time: Thu Jan 01 0=
5:30:00 IST 1970=0ARetries since fetch: 0=0ARetry interval: 2592000 seconds=
 (30 days)=0AScore: 1.0=0ASignature: null=0AMetadata: _ngt_: 1341314531887=
=0A=0ACrawlDatum::=0AVersion: 7=0AStatus: 37 (fetch_gone)=0AFetch time: Tue=
 Jul 03 16:52:17 IST 2012=0AModified time: Thu Jan 01 05:30:00 IST 1970=0AR=
etries since fetch: 0=0ARetry interval: 2592000 seconds (30 days)=0AScore: =
1.0=0ASignature: null=0AMetadata: _ngt_: 1341314531887_pst_: notfound(14), =
lastModified=3D0:=0A http://en.wikipedia.org/wiki/Districts_of_India/=0A=0A=
Note the _pst_ : notfound(14)!!!=0A=0ADoes this mean that on fetch the url =
returns status as 404 and therefore fetch is unable to carry on....=0AThis =
will be strange as parsechecker seems to be fine fetching and parsing the l=
inks in this url into outlinks.=0ASo, it might be that the failure to parse=
 the robots.txt is NOT an issue - the issue is that fetcher stops as it doe=
s not get anything when trying to fetch the contents of the url: http://en.=
wikipedia.org/wiki/Districts_of_India/=0A=0AAppreciate all the help that ha=
s coming my way.=0A-Arijit=0A=0A=0A=0A________________________________=0A F=
rom: Ken Krugler <kkrugler_lists@transpac.com>=0ATo: user@nutch.apache.org =
=0ASent: Monday, July 2, 2012 10:56 PM=0ASubject: Re: parsechecker fetches =
url but fetcher fails=0A =0A=0A=0A=0AOn Jul 2, 2012, at 5:00am, arijit wrot=
e:=0A=0AHi,=0A>=A0=A0 Since learning that nutch will be unable to crawl the=
 javascript function calls in href, I started looking for other alternative=
s. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.=0A>=
=A0=A0=A0 I first tried injecting this URL and follow the step-by-step appr=
oach till fetcher - when I realized, nutch did not fetch anything from this=
 website. I tried looking into logs/hadoop.log and found the following 3 li=
nes - which I believe could be saying that nutch is unable to parse the rob=
ots.txt in the website and ttherefore, fetcher stopped?=0A>=0A>=A0=A0=A0 =
=0A>=0A>=A0=A0=A0 2012-07-02 16:41:07,452 WARN=A0 api.RobotRulesParser - er=
ror parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Com=
mittee/=0A>=A0=A0=A0=0A 2012-07-02 16:41:07,452 WARN=A0 api.RobotRulesParse=
r - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Me=
diation_Committee/=0A>=A0=A0=A0 2012-07-02 16:41:07,452 WARN=A0 api.RobotRu=
lesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%=
3Mediation_Cabal/Cases/=0AThe issue is that the Wikipedia robots.txt file c=
ontains malformed URLs - these three are missing the 'A' from the %3A seque=
nce.=0A=0A=0A=A0 =A0 I tried checking the URL using parsechecker and no iss=
ues there! I think it means that the robots.txt is malformed for this websi=
te, which is preventing fetcher from fetching anything. Is there a way to g=
et around this problem, as parsechecker seems to go on its merry way parsin=
g.=0AThis is an example of where having Nutch use crawler-commons robots.tx=
t parser would help :)=0A=0Ahttps://issues.apache.org/jira/browse/NUTCH-103=
1=0A=0A-- Ken=0A=0A--------------------------=0AKen Krugler=0Ahttp://www.sc=
aleunlimited.com=0Acustom big data solutions & training=0AHadoop, Cascading=
, Mahout & Solr
--1124294693-1894354133-1341396775=:91424--