Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 69655 invoked from network); 9 Sep 2009 15:26:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Sep 2009 15:26:30 -0000 Received: (qmail 94108 invoked by uid 500); 9 Sep 2009 15:26:29 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 94044 invoked by uid 500); 9 Sep 2009 15:26:29 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 94034 invoked by uid 99); 9 Sep 2009 15:26:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 15:26:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dmc@colegroup.com designates 75.28.42.41 as permitted sender) Received: from [75.28.42.41] (HELO mail2.colegroup.com) (75.28.42.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 15:26:19 +0000 Received: from [192.168.255.14] (192.168.255.14) by mail2.colegroup.com with ESMTP (EIMS X 3.3.6) for ; Wed, 9 Sep 2009 08:25:58 -0700 Mime-Version: 1.0 Message-Id: In-Reply-To: <5920c8cb0909090334xd2bca97hc79207ff985f1bdf@mail.gmail.com> References: <5920c8cb0909090334xd2bca97hc79207ff985f1bdf@mail.gmail.com> Date: Wed, 9 Sep 2009 08:25:47 -0700 To: nutch-user@lucene.apache.org From: "David M. Cole" Subject: Re: Crawling Password Protected Pages Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Virus-Checked: Checked by ClamAV on apache.org kranthi: i would try removing the authscope tag from the httpclient-auth.xml. though in my case i'm not going to an alternate port and you are, my working file does not have an authscope tag. if that doesn't help, since you are crawling an intranet, do you have access to the http server's log? seeing that might help. \dmc At 4:04 PM +0530 9/9/09, kranthi reddy wrote: >Hi all, > > I am trying to crawl password protected web pages present in our intranet . >I don't know the reason why "*401 Authentication Required*" error creeps up. >I have gone through the previous mails sent by others, but it is not getting >resolved. > >Below are the configuration files i have modified as told in " >http://wiki.apache.org/nutch/HttpAuthenticationSchemes" > >My Url file contains single url *"http://10.2.44.34:8088/xwiki/" *(This >url is actually being redirect to "* >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=CDsTIqqN*") > >*"httpclient-auth.xml* " > > > > > > >*"nutch-default.xml"* > > > plugin.includes > *protocol-httpclient|* >urlfilter-regex|parse-(text|html|js|zip)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)| > >summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > >*OutPut Printed to Terminal* > >Fetcher: Your 'http.agent.name' value should be listed first in >'http.robots.agents' property. >Fetcher: starting >Fetcher: segment: crawl/segments/20090909151219 >Fetcher: threads: 10 >QueueFeeder finished: total 1 records. >fetching http://10.2.44.34:8088/xwiki/ >http.proxy.host = null >http.proxy.port = 8080 >http.timeout = 10000 >http.content.limit = -1 >http.agent = iiith/Nutch-1.0 (kranthili2020@gmail.com) >protocol.plugin.check.blocking = false >protocol.plugin.check.robots = false >*Credentials - username: superadmin; set as default for realm: ; scheme:* >-finishing thread FetcherThread, activeThreads=1 >-finishing thread FetcherThread, activeThreads=1 >*Credentials - username: superadmin; set for AuthScope - host: 10.2.44.34; >port: 8088; realm: ; scheme: >Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found >for url: http://10.2.44.34:8088/robots.txt >url: http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: >6739; Content-Length: 6739 >Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found >for url: http://10.2.44.34:8088/xwiki/ >url: http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0; >Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/* >-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1 >* queue: http://10.2.44.34 > maxThreads = 1 > inProgress = 0 > crawlDelay = 1000 > minCrawlDelay = 0 > nextFetchTime = 1252489344874 > now = 1252489344577 > 0. http://10.2.44.34:8088/xwiki/bin/view/Main/ >*fetching http://10.2.44.34:8088/xwiki/bin/view/Main/ >Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found >for url: http://10.2.44.34:8088/xwiki/bin/view/Main/ >url: http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes >received: 0; Content-Length: 0; Location: >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX* >-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1 >* queue: http://10.2.44.34 > maxThreads = 1 > inProgress = 0 > crawlDelay = 1000 > minCrawlDelay = 0 > nextFetchTime = 1252489345884 > now = 1252489345578 > 0. http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >*fetching >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found >for url: >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >url: http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX; >status code: 401; bytes received: 6739; Content-Length: 6739 >401 Authentication Required* >-finishing thread FetcherThread, activeThreads=0 >-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >-activeThreads=0 >Fetcher: done > > > >*LOG FILE IS* > > >2009-09-09 15:46:55,602 INFO fetcher.Fetcher - fetching >http://10.2.44.34:8088/xwiki/ >2009-09-09 15:46:55,657 INFO fetcher.Fetcher - -finishing thread >FetcherThread, activeThreads=1 >2009-09-09 15:46:55,657 INFO fetcher.Fetcher - -finishing thread >FetcherThread, activeThreads=1 >2009-09-09 15:46:55,691 INFO httpclient.Http - http.proxy.host = null >2009-09-09 15:46:55,691 INFO httpclient.Http - http.proxy.port = 8080 >2009-09-09 15:46:55,691 INFO httpclient.Http - http.timeout = 10000 >2009-09-09 15:46:55,691 INFO httpclient.Http - http.content.limit = -1 >2009-09-09 15:46:55,691 INFO httpclient.Http - http.agent = iiith/Nutch-1.0 >(kranthili2020@gmail.com) >2009-09-09 15:46:55,691 INFO httpclient.Http - >protocol.plugin.check.blocking = false >2009-09-09 15:46:55,691 INFO httpclient.Http - protocol.plugin.check.robots >= false >2009-09-09 15:46:55,695 DEBUG httpclient.Http - Credentials - username: >superadmin; set as default for realm: ; scheme: >2009-09-09 15:46:55,697 DEBUG httpclient.Http - Credentials - username: >superadmin; set for AuthScope - host: 10.2.44.34; port: 8088; realm: ; >scheme: >*2009-09-09 15:46:55,697 DEBUG httpclient.Http - Pre-configured credentials >with scope - host: 10.2.44.34; port: 8088; found for url: >http://10.2.44.34:8088/robots.txt >2009-09-09 15:46:55,942 DEBUG httpclient.Http - url: >http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: 6739; >Content-Length: 6739 >2009-09-09 15:46:55,943 DEBUG httpclient.Http - Pre-configured credentials >with scope - host: 10.2.44.34; port: 8088; found for url: >http://10.2.44.34:8088/xwiki/ >2009-09-09 15:46:55,946 INFO httpclient.HttpMethodDirector - Redirect >requested but followRedirects is disabled >2009-09-09 15:46:55,946 DEBUG httpclient.Http - url: >http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0; >Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/* >2009-09-09 15:46:56,657 INFO fetcher.Fetcher - -activeThreads=1, >spinWaiting=1, fetchQueues.totalSize=1 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - * queue: http://10.2.44.34 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - maxThreads = 1 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - inProgress = 0 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - crawlDelay = 1000 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - minCrawlDelay = 0 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - nextFetchTime = >1252491417050 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - now = >1252491416658 >2009-09-09 15:46:56,658 INFO fetcher.Fetcher - 0. >http://10.2.44.34:8088/xwiki/bin/view/Main/ >2009-09-09 15:46:57,051 INFO fetcher.Fetcher - fetching >http://10.2.44.34:8088/xwiki/bin/view/Main/ >2*009-09-09 15:46:57,051 DEBUG httpclient.Http - Pre-configured credentials >with scope - host: 10.2.44.34; port: 8088; found for url: >http://10.2.44.34:8088/xwiki/bin/view/Main/ >2009-09-09 15:46:57,056 INFO httpclient.HttpMethodDirector - Redirect >requested but followRedirects is disabled >2009-09-09 15:46:57,057 DEBUG httpclient.Http - url: >http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes >received: 0; Content-Length: 0; Location: >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1* >2009-09-09 15:46:57,658 INFO fetcher.Fetcher - -activeThreads=1, >spinWaiting=1, fetchQueues.totalSize=1 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - * queue: http://10.2.44.34 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - maxThreads = 1 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - inProgress = 0 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - crawlDelay = 1000 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - minCrawlDelay = 0 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - nextFetchTime = >1252491418057 >2009-09-09 15:46:57,659 INFO fetcher.Fetcher - now = >1252491417659 >*2009-09-09 15:46:57,659 INFO fetcher.Fetcher - 0. >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >2009-09-09 15:46:58,058 INFO fetcher.Fetcher - fetching >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >2009-09-09 15:46:58,058 DEBUG httpclient.Http - Pre-configured credentials >with scope - host: 10.2.44.34; port: 8088; found for url: >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >2009-09-09 15:46:58,170 DEBUG httpclient.Http - url: >http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1; >status code: 401; bytes received: 6739; Content-Length: 6739 >2009-09-09 15:46:58,180 DEBUG httpclient.Http - 401 Authentication Required* >2009-09-09 15:46:58,180 INFO fetcher.Fetcher - -finishing thread >FetcherThread, activeThreads=0 >2009-09-09 15:46:58,659 INFO fetcher.Fetcher - -activeThreads=0, >spinWaiting=0, fetchQueues.totalSize=0 >2009-09-09 15:46:58,659 INFO fetcher.Fetcher - -activeThreads=0 > > >Thank you in advance, > >bye, >Kranthi Reddy. B -- *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+ David M. Cole dmc@colegroup.com Editor & Publisher, NewsInc. V: (650) 557-2993 Consultant: The Cole Group F: (650) 475-8479 *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+