nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David M. Cole" <...@colegroup.com>
Subject Re: Crawling Password Protected Pages
Date Wed, 09 Sep 2009 15:25:47 GMT
kranthi:

i would try removing the authscope tag from the httpclient-auth.xml. 
though in my case i'm not going to an alternate port and you are, my 
working file does not have an authscope tag.

if that doesn't help, since you are crawling an intranet, do you have 
access to the http server's log? seeing that might help.

\dmc


At 4:04 PM +0530 9/9/09, kranthi reddy wrote:
>Hi all,
>
>  I am trying to crawl password protected web pages present in our intranet .
>I don't know the reason why "*401 Authentication Required*" error creeps up.
>I have gone through the previous mails sent by others, but it is not getting
>resolved.
>
>Below are the configuration files i have modified as told in "
>http://wiki.apache.org/nutch/HttpAuthenticationSchemes"
>
>My Url file contains single url  *"http://10.2.44.34:8088/xwiki/"  *(This
>url is actually being redirect to "*
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=CDsTIqqN*")
>
>*"httpclient-auth.xml* "
>
>                  <credentials username="xyz" password="xyz">
>                  <default/>
>                  <authscope host="10.2.44.34" port="8088"/>
>                  </credentials>
>
>*"nutch-default.xml"*
>
>                  <property>
>                  <name>plugin.includes</name>
>                  <value>*protocol-httpclient|*
>urlfilter-regex|parse-(text|html|js|zip)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
>
>summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>                  </property>
>
>*OutPut Printed to Terminal*
>
>Fetcher: Your 'http.agent.name' value should be listed first in
>'http.robots.agents' property.
>Fetcher: starting
>Fetcher: segment: crawl/segments/20090909151219
>Fetcher: threads: 10
>QueueFeeder finished: total 1 records.
>fetching http://10.2.44.34:8088/xwiki/
>http.proxy.host = null
>http.proxy.port = 8080
>http.timeout = 10000
>http.content.limit = -1
>http.agent = iiith/Nutch-1.0 (kranthili2020@gmail.com)
>protocol.plugin.check.blocking = false
>protocol.plugin.check.robots = false
>*Credentials - username: superadmin; set as default for realm: ; scheme:*
>-finishing thread FetcherThread, activeThreads=1
>-finishing thread FetcherThread, activeThreads=1
>*Credentials - username: superadmin; set for AuthScope - host: 10.2.44.34;
>port: 8088; realm: ; scheme:
>Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
>for url: http://10.2.44.34:8088/robots.txt
>url: http://10.2.44.34:8088/robots.txt; status code: 401; bytes received:
>6739; Content-Length: 6739
>Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
>for url: http://10.2.44.34:8088/xwiki/
>url: http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
>Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
>-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
>* queue: http://10.2.44.34
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 1000
>   minCrawlDelay = 0
>   nextFetchTime = 1252489344874
>   now           = 1252489344577
>   0. http://10.2.44.34:8088/xwiki/bin/view/Main/
>*fetching http://10.2.44.34:8088/xwiki/bin/view/Main/
>Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
>for url: http://10.2.44.34:8088/xwiki/bin/view/Main/
>url: http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
>received: 0; Content-Length: 0; Location:
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX*
>-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
>* queue: http://10.2.44.34
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 1000
>   minCrawlDelay = 0
>   nextFetchTime = 1252489345884
>   now           = 1252489345578
>   0. http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>*fetching
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
>for url:
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>url: http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX;
>status code: 401; bytes received: 6739; Content-Length: 6739
>401 Authentication Required*
>-finishing thread FetcherThread, activeThreads=0
>-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>-activeThreads=0
>Fetcher: done
>
>
>
>*LOG FILE IS*
>
>
>2009-09-09 15:46:55,602 INFO  fetcher.Fetcher - fetching
>http://10.2.44.34:8088/xwiki/
>2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
>FetcherThread, activeThreads=1
>2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
>FetcherThread, activeThreads=1
>2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.host = null
>2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.port = 8080
>2009-09-09 15:46:55,691 INFO  httpclient.Http - http.timeout = 10000
>2009-09-09 15:46:55,691 INFO  httpclient.Http - http.content.limit = -1
>2009-09-09 15:46:55,691 INFO  httpclient.Http - http.agent = iiith/Nutch-1.0
>(kranthili2020@gmail.com)
>2009-09-09 15:46:55,691 INFO  httpclient.Http -
>protocol.plugin.check.blocking = false
>2009-09-09 15:46:55,691 INFO  httpclient.Http - protocol.plugin.check.robots
>= false
>2009-09-09 15:46:55,695 DEBUG httpclient.Http - Credentials - username:
>superadmin; set as default for realm: ; scheme:
>2009-09-09 15:46:55,697 DEBUG httpclient.Http - Credentials - username:
>superadmin; set for AuthScope - host: 10.2.44.34; port: 8088; realm: ;
>scheme:
>*2009-09-09 15:46:55,697 DEBUG httpclient.Http - Pre-configured credentials
>with scope - host: 10.2.44.34; port: 8088; found for url:
>http://10.2.44.34:8088/robots.txt
>2009-09-09 15:46:55,942 DEBUG httpclient.Http - url:
>http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: 6739;
>Content-Length: 6739
>2009-09-09 15:46:55,943 DEBUG httpclient.Http - Pre-configured credentials
>with scope - host: 10.2.44.34; port: 8088; found for url:
>http://10.2.44.34:8088/xwiki/
>2009-09-09 15:46:55,946 INFO  httpclient.HttpMethodDirector - Redirect
>requested but followRedirects is disabled
>2009-09-09 15:46:55,946 DEBUG httpclient.Http - url:
>http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
>Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
>2009-09-09 15:46:56,657 INFO  fetcher.Fetcher - -activeThreads=1,
>spinWaiting=1, fetchQueues.totalSize=1
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher - * queue: http://10.2.44.34
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   maxThreads    = 1
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   inProgress    = 0
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   crawlDelay    = 1000
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   minCrawlDelay = 0
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   nextFetchTime =
>1252491417050
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   now           =
>1252491416658
>2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   0.
>http://10.2.44.34:8088/xwiki/bin/view/Main/
>2009-09-09 15:46:57,051 INFO  fetcher.Fetcher - fetching
>http://10.2.44.34:8088/xwiki/bin/view/Main/
>2*009-09-09 15:46:57,051 DEBUG httpclient.Http - Pre-configured credentials
>with scope - host: 10.2.44.34; port: 8088; found for url:
>http://10.2.44.34:8088/xwiki/bin/view/Main/
>2009-09-09 15:46:57,056 INFO  httpclient.HttpMethodDirector - Redirect
>requested but followRedirects is disabled
>2009-09-09 15:46:57,057 DEBUG httpclient.Http - url:
>http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
>received: 0; Content-Length: 0; Location:
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1*
>2009-09-09 15:46:57,658 INFO  fetcher.Fetcher - -activeThreads=1,
>spinWaiting=1, fetchQueues.totalSize=1
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher - * queue: http://10.2.44.34
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   maxThreads    = 1
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   inProgress    = 0
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   crawlDelay    = 1000
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   minCrawlDelay = 0
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   nextFetchTime =
>1252491418057
>2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   now           =
>1252491417659
>*2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   0.
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>2009-09-09 15:46:58,058 INFO  fetcher.Fetcher - fetching
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>2009-09-09 15:46:58,058 DEBUG httpclient.Http - Pre-configured credentials
>with scope - host: 10.2.44.34; port: 8088; found for url:
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>2009-09-09 15:46:58,170 DEBUG httpclient.Http - url:
>http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1;
>status code: 401; bytes received: 6739; Content-Length: 6739
>2009-09-09 15:46:58,180 DEBUG httpclient.Http - 401 Authentication Required*
>2009-09-09 15:46:58,180 INFO  fetcher.Fetcher - -finishing thread
>FetcherThread, activeThreads=0
>2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0,
>spinWaiting=0, fetchQueues.totalSize=0
>2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0
>
>
>Thank you in advance,
>
>bye,
>Kranthi Reddy. B


-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Mime
View raw message