nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kranthi reddy <kranthili2...@gmail.com>
Subject Crawling Password Protected Pages
Date Wed, 09 Sep 2009 10:34:41 GMT
Hi all,

 I am trying to crawl password protected web pages present in our intranet .
I don't know the reason why "*401 Authentication Required*" error creeps up.
I have gone through the previous mails sent by others, but it is not getting
resolved.

Below are the configuration files i have modified as told in "
http://wiki.apache.org/nutch/HttpAuthenticationSchemes"

My Url file contains single url  *"http://10.2.44.34:8088/xwiki/"  *(This
url is actually being redirect to "*
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=CDsTIqqN*")


*"httpclient-auth.xml* "

                 <credentials username="xyz" password="xyz">
                 <default/>
                 <authscope host="10.2.44.34" port="8088"/>
                 </credentials>

*"nutch-default.xml"*

                 <property>
                 <name>plugin.includes</name>
                 <value>*protocol-httpclient|*
urlfilter-regex|parse-(text|html|js|zip)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|

summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
                 </property>

*OutPut Printed to Terminal*

Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20090909151219
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching http://10.2.44.34:8088/xwiki/
http.proxy.host = null
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = -1
http.agent = iiith/Nutch-1.0 (kranthili2020@gmail.com)
protocol.plugin.check.blocking = false
protocol.plugin.check.robots = false
*Credentials - username: superadmin; set as default for realm: ; scheme:*
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
*Credentials - username: superadmin; set for AuthScope - host: 10.2.44.34;
port: 8088; realm: ; scheme:
Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
for url: http://10.2.44.34:8088/robots.txt
url: http://10.2.44.34:8088/robots.txt; status code: 401; bytes received:
6739; Content-Length: 6739
Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
for url: http://10.2.44.34:8088/xwiki/
url: http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
* queue: http://10.2.44.34
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 1000
  minCrawlDelay = 0
  nextFetchTime = 1252489344874
  now           = 1252489344577
  0. http://10.2.44.34:8088/xwiki/bin/view/Main/
*fetching http://10.2.44.34:8088/xwiki/bin/view/Main/
Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
for url: http://10.2.44.34:8088/xwiki/bin/view/Main/
url: http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
received: 0; Content-Length: 0; Location:
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX*
-activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
* queue: http://10.2.44.34
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 1000
  minCrawlDelay = 0
  nextFetchTime = 1252489345884
  now           = 1252489345578
  0. http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
*fetching
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; found
for url:
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
url: http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX;
status code: 401; bytes received: 6739; Content-Length: 6739
401 Authentication Required*
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done



*LOG FILE IS*


2009-09-09 15:46:55,602 INFO  fetcher.Fetcher - fetching
http://10.2.44.34:8088/xwiki/
2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.host = null
2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.port = 8080
2009-09-09 15:46:55,691 INFO  httpclient.Http - http.timeout = 10000
2009-09-09 15:46:55,691 INFO  httpclient.Http - http.content.limit = -1
2009-09-09 15:46:55,691 INFO  httpclient.Http - http.agent = iiith/Nutch-1.0
(kranthili2020@gmail.com)
2009-09-09 15:46:55,691 INFO  httpclient.Http -
protocol.plugin.check.blocking = false
2009-09-09 15:46:55,691 INFO  httpclient.Http - protocol.plugin.check.robots
= false
2009-09-09 15:46:55,695 DEBUG httpclient.Http - Credentials - username:
superadmin; set as default for realm: ; scheme:
2009-09-09 15:46:55,697 DEBUG httpclient.Http - Credentials - username:
superadmin; set for AuthScope - host: 10.2.44.34; port: 8088; realm: ;
scheme:
*2009-09-09 15:46:55,697 DEBUG httpclient.Http - Pre-configured credentials
with scope - host: 10.2.44.34; port: 8088; found for url:
http://10.2.44.34:8088/robots.txt
2009-09-09 15:46:55,942 DEBUG httpclient.Http - url:
http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: 6739;
Content-Length: 6739
2009-09-09 15:46:55,943 DEBUG httpclient.Http - Pre-configured credentials
with scope - host: 10.2.44.34; port: 8088; found for url:
http://10.2.44.34:8088/xwiki/
2009-09-09 15:46:55,946 INFO  httpclient.HttpMethodDirector - Redirect
requested but followRedirects is disabled
2009-09-09 15:46:55,946 DEBUG httpclient.Http - url:
http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
2009-09-09 15:46:56,657 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=1, fetchQueues.totalSize=1
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher - * queue: http://10.2.44.34
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   maxThreads    = 1
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   inProgress    = 0
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   crawlDelay    = 1000
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   minCrawlDelay = 0
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   nextFetchTime =
1252491417050
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   now           =
1252491416658
2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   0.
http://10.2.44.34:8088/xwiki/bin/view/Main/
2009-09-09 15:46:57,051 INFO  fetcher.Fetcher - fetching
http://10.2.44.34:8088/xwiki/bin/view/Main/
2*009-09-09 15:46:57,051 DEBUG httpclient.Http - Pre-configured credentials
with scope - host: 10.2.44.34; port: 8088; found for url:
http://10.2.44.34:8088/xwiki/bin/view/Main/
2009-09-09 15:46:57,056 INFO  httpclient.HttpMethodDirector - Redirect
requested but followRedirects is disabled
2009-09-09 15:46:57,057 DEBUG httpclient.Http - url:
http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
received: 0; Content-Length: 0; Location:
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1*
2009-09-09 15:46:57,658 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=1, fetchQueues.totalSize=1
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher - * queue: http://10.2.44.34
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   maxThreads    = 1
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   inProgress    = 0
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   crawlDelay    = 1000
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   minCrawlDelay = 0
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   nextFetchTime =
1252491418057
2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   now           =
1252491417659
*2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   0.
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
2009-09-09 15:46:58,058 INFO  fetcher.Fetcher - fetching
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
2009-09-09 15:46:58,058 DEBUG httpclient.Http - Pre-configured credentials
with scope - host: 10.2.44.34; port: 8088; found for url:
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
2009-09-09 15:46:58,170 DEBUG httpclient.Http - url:
http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1;
status code: 401; bytes received: 6739; Content-Length: 6739
2009-09-09 15:46:58,180 DEBUG httpclient.Http - 401 Authentication Required*
2009-09-09 15:46:58,180 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0


Thank you in advance,

bye,
Kranthi Reddy. B

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message