Return-Path: X-Original-To: apmail-nutch-user-archive@www.apache.org Delivered-To: apmail-nutch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BC9AF9254 for ; Wed, 4 Jul 2012 18:19:41 +0000 (UTC) Received: (qmail 84818 invoked by uid 500); 4 Jul 2012 18:19:40 -0000 Delivered-To: apmail-nutch-user-archive@nutch.apache.org Received: (qmail 84779 invoked by uid 500); 4 Jul 2012 18:19:40 -0000 Mailing-List: contact user-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@nutch.apache.org Delivered-To: mailing list user@nutch.apache.org Delivered-To: moderator for user@nutch.apache.org Received: (qmail 19448 invoked by uid 99); 4 Jul 2012 10:13:22 -0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=FREEMAIL_FORGED_REPLYTO,FSL_FREEMAIL_1,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 413528.93452.bm@omp1005.mail.bf1.yahoo.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1341396775; bh=FPgEJGC+b1l0qevG29wLYBnNrpwaAXTX0wYGoxwd8Bk=; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=iVPg/cZNg0Lte6UavY5tkU2MGGqnoISTR9O/yWyzyhEZkEKw79Ez7u7o/+dQt7o3Ng0vHuG2pUAcU2fn6DTi5WPLexW7RvEb+55aJOkHVYmAJ4rUKcblwdspzDHnn3vbRppAI2cUHGwHI2dUKRpvLJB/loTAxGt4soY+vO2m+I0= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=NLvdNRBlrebo3p4BjXMA7Mb373zD1YleSJghzRu52GIJNdz16I/GA6DtIGrv1mHEAZAyjILUlI84b2ME0JRXmRiLEo2pMbSor8V3ImjTjdZXZRO7hC6VFERt1GK0tZ/+/Sw9uB2CrctkqPIM+0M++ZPfStfS58WBe+hg+bQXerg=; X-YMail-OSG: bihEloIVM1ncCk2ZTaswrGo2ZQ8jq.nIuoU4ocPBvQ6.1sO 5hBaO5SJ56dhn.1_UKXjlnQw.eu1l3.4gdpLRsQQisiLhCpCtmYaFGG9VhRf bdYgy16ntxQfO8mCOEEAGO6wTwA9v6AWrFOnmbqJnbNq_E3JaByxmkTgYvqP TZzSgWYkPDU9JQjoFeDsyi9yR9e5HTUlKB9k.CBvlzTIu2B3s.dhmEXtkZU5 O2OiXogs6EEsW.q5x5jKD99ALXKAsfJst3qYlchke4RZ0F3siwa1_Td3SUu_ ERfDwN_2a9Y9.op7Gdb3VG8Pf8XNXRM1LRbki63sK6Ru6RmA5bCO3b8y5m1Z .4tqcL1HrOshLA9nXrCLPKB_eDZs8DnhyQsQJonb0PRv.b_p8hs6i8YTMEjX .0cpvf20vj5BSwpwqqnj9rBnlx5.iVAIe4ZL0b8iP31HU1_wE1zS5nLa_Ces ZrhcAqZStcl9eVfYTjCVQcfkQQ8CGTDtgP2w87ymfrjKCurWStitQuzQbmJe XD6CyuKzPEvnVARYKX9udyeSqjI.NXcE_YZ3WdLp6gy6B5BIOfcgJXAVvD_l uGPpCI84iLXFi5YpTLxGwyyx2wakQ6gj2kRTKEStRE5qu0lNNHub7HI60x.Q UBPPr59cvy_aE_MUdkyYvdfIDqfs- X-Mailer: YahooMailWebService/0.8.118.349524 References: <1341230421.43037.YahooMailNeo@web160806.mail.bf1.yahoo.com> <1341316686.36225.YahooMailNeo@web160806.mail.bf1.yahoo.com> Message-ID: <1341396775.91424.YahooMailNeo@web160804.mail.bf1.yahoo.com> Date: Wed, 4 Jul 2012 03:12:55 -0700 (PDT) From: arijit Reply-To: arijit Subject: Re: parsechecker fetches url but fetcher fails - happens only in nutch 1.5 To: "user@nutch.apache.org" In-Reply-To: <1341316686.36225.YahooMailNeo@web160806.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="1124294693-1894354133-1341396775=:91424" X-Virus-Checked: Checked by ClamAV on apache.org --1124294693-1894354133-1341396775=:91424 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hi,=0A=A0=A0 Ken was right and my assumption was wrong - the issue of fetch= er failing is NOT because of the robots.txt warning. It was happening becau= se I had the seed.txt mentioning the seed url as : http://en.wikipedia.org/= wiki/Districts_of_India/ with a trailing separator. Once I took that separa= tor out, the fetch and crawl of outlinks went fine!=0A=A0=A0 But, I was not= destined to have all of the cake in 1 go. I upgraded to nutch 1.5 and trie= d running the same crawl and it failed. Looking at hadoop,log shows that th= e robots.txt fetch is now returning =0A=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hadoop.log snippet=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=0A=0A=0A=A0=A0 2012-07-04 15:12:40,833 INFO=A0 api.Robot= RulesParser - Couldn't get robots.txt for http://en.wikipedia.org/wiki/Dist= ricts_of_India: java.io.IOException: unzipBestEffort returned null=0A2012-0= 7-04 15:12:41,224 INFO=A0 fetcher.Fetcher - -activeThreads=3D1, spinWaiting= =3D0, fetchQueues.totalSize=3D0=0A2012-07-04 15:12:41,678 ERROR http.Http -= Failed to get protocol output=0Ajava.io.IOException: unzipBestEffort retur= ned null=0A=A0=A0=A0 at org.apache.nutch.protocol.http.api.HttpBase.process= GzipEncoded(HttpBase.java:319)=0A=A0=A0=A0 at org.apache.nutch.protocol.htt= p.HttpResponse.(HttpResponse.java:162)=0A=A0=A0=A0 at org.apache.nutc= h.protocol.http.Http.getResponse(Http.java:64)=0A=A0=A0=A0 at org.apache.nu= tch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)=0A=A0= =A0=A0 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:6= 92)=0A2012-07-04 15:12:41,680 INFO=A0 fetcher.Fetcher - fetch of http://en.= wikipedia.org/wiki/Districts_of_India failed with: java.io.IOException: unz= ipBestEffort returned null=0A=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D hadoop.log snippet ends =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=0A=0AAnd therefore, fetching of the wikipedia url bails out.=0AI = did check that there was a patch for this type of issue on 1.4 - https://is= sues.apache.org/jira/browse/NUTCH-1089 (though the url is not compressed). = However, that change is already in 1.5 - and that cannot be the source of t= his problem.=0A=A0Any help is much appreciated.=0A=0A-Arijit=0A=0A=A0=0A=0A= =0A________________________________=0A From: arijit =0AT= o: "user@nutch.apache.org" =0ASent: Tuesday, July 3= , 2012 5:28 PM=0ASubject: Re: parsechecker fetches url but fetcher fails=0A= =0A=0AHi,=0A=A0=A0 I did some more digging around - and noticed this in th= e output from readseg:=0A=0ARecno:: 0=0AURL:: http://en.wikipedia.org/wiki/= Districts_of_India/=0A=0ACrawlDatum::=0AVersion: 7=0AStatus: 1 (db_unfetche= d)=0AFetch time: Tue Jul 03 16:52:09 IST 2012=0AModified time: Thu Jan 01 0= 5:30:00 IST 1970=0ARetries since fetch: 0=0ARetry interval: 2592000 seconds= (30 days)=0AScore: 1.0=0ASignature: null=0AMetadata: _ngt_: 1341314531887= =0A=0ACrawlDatum::=0AVersion: 7=0AStatus: 37 (fetch_gone)=0AFetch time: Tue= Jul 03 16:52:17 IST 2012=0AModified time: Thu Jan 01 05:30:00 IST 1970=0AR= etries since fetch: 0=0ARetry interval: 2592000 seconds (30 days)=0AScore: = 1.0=0ASignature: null=0AMetadata: _ngt_: 1341314531887_pst_: notfound(14), = lastModified=3D0:=0A http://en.wikipedia.org/wiki/Districts_of_India/=0A=0A= Note the _pst_ : notfound(14)!!!=0A=0ADoes this mean that on fetch the url = returns status as 404 and therefore fetch is unable to carry on....=0AThis = will be strange as parsechecker seems to be fine fetching and parsing the l= inks in this url into outlinks.=0ASo, it might be that the failure to parse= the robots.txt is NOT an issue - the issue is that fetcher stops as it doe= s not get anything when trying to fetch the contents of the url: http://en.= wikipedia.org/wiki/Districts_of_India/=0A=0AAppreciate all the help that ha= s coming my way.=0A-Arijit=0A=0A=0A=0A________________________________=0A F= rom: Ken Krugler =0ATo: user@nutch.apache.org = =0ASent: Monday, July 2, 2012 10:56 PM=0ASubject: Re: parsechecker fetches = url but fetcher fails=0A =0A=0A=0A=0AOn Jul 2, 2012, at 5:00am, arijit wrot= e:=0A=0AHi,=0A>=A0=A0 Since learning that nutch will be unable to crawl the= javascript function calls in href, I started looking for other alternative= s. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.=0A>= =A0=A0=A0 I first tried injecting this URL and follow the step-by-step appr= oach till fetcher - when I realized, nutch did not fetch anything from this= website. I tried looking into logs/hadoop.log and found the following 3 li= nes - which I believe could be saying that nutch is unable to parse the rob= ots.txt in the website and ttherefore, fetcher stopped?=0A>=0A>=A0=A0=A0 = =0A>=0A>=A0=A0=A0 2012-07-02 16:41:07,452 WARN=A0 api.RobotRulesParser - er= ror parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Com= mittee/=0A>=A0=A0=A0=0A 2012-07-02 16:41:07,452 WARN=A0 api.RobotRulesParse= r - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Me= diation_Committee/=0A>=A0=A0=A0 2012-07-02 16:41:07,452 WARN=A0 api.RobotRu= lesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%= 3Mediation_Cabal/Cases/=0AThe issue is that the Wikipedia robots.txt file c= ontains malformed URLs - these three are missing the 'A' from the %3A seque= nce.=0A=0A=0A=A0 =A0 I tried checking the URL using parsechecker and no iss= ues there! I think it means that the robots.txt is malformed for this websi= te, which is preventing fetcher from fetching anything. Is there a way to g= et around this problem, as parsechecker seems to go on its merry way parsin= g.=0AThis is an example of where having Nutch use crawler-commons robots.tx= t parser would help :)=0A=0Ahttps://issues.apache.org/jira/browse/NUTCH-103= 1=0A=0A-- Ken=0A=0A--------------------------=0AKen Krugler=0Ahttp://www.sc= aleunlimited.com=0Acustom big data solutions & training=0AHadoop, Cascading= , Mahout & Solr --1124294693-1894354133-1341396775=:91424--