nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kranthi reddy <kranthili2...@gmail.com>
Subject Re: Crawling Password Protected Pages
Date Fri, 11 Sep 2009 18:13:51 GMT
hi,

 I had tried removing the authscope tag as well. The problem was not with
that in fact. It was because I was trying to crawl pages that were using
POST BASED AUTHENTICATION. Any suggestions as to how we can crawl pages that
use POST BASED AUTHENTICATION ?

bye,
kranthi

On Wed, Sep 9, 2009 at 8:55 PM, David M. Cole <dmc@colegroup.com> wrote:

> kranthi:
>
> i would try removing the authscope tag from the httpclient-auth.xml. though
> in my case i'm not going to an alternate port and you are, my working file
> does not have an authscope tag.
>
> if that doesn't help, since you are crawling an intranet, do you have
> access to the http server's log? seeing that might help.
>
> \dmc
>
>
>
> At 4:04 PM +0530 9/9/09, kranthi reddy wrote:
>
>> Hi all,
>>
>>  I am trying to crawl password protected web pages present in our intranet
>> .
>> I don't know the reason why "*401 Authentication Required*" error creeps
>> up.
>> I have gone through the previous mails sent by others, but it is not
>> getting
>> resolved.
>>
>> Below are the configuration files i have modified as told in "
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes"
>>
>> My Url file contains single url  *"http://10.2.44.34:8088/xwiki/"  *(This
>> url is actually being redirect to "*
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=CDsTIqqN*")
>>
>> *"httpclient-auth.xml* "
>>
>>                 <credentials username="xyz" password="xyz">
>>                 <default/>
>>                 <authscope host="10.2.44.34" port="8088"/>
>>                 </credentials>
>>
>> *"nutch-default.xml"*
>>
>>                 <property>
>>                 <name>plugin.includes</name>
>>                 <value>*protocol-httpclient|*
>>
>> urlfilter-regex|parse-(text|html|js|zip)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
>>
>> summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>                 </property>
>>
>> *OutPut Printed to Terminal*
>>
>> Fetcher: Your 'http.agent.name' value should be listed first in
>> 'http.robots.agents' property.
>> Fetcher: starting
>> Fetcher: segment: crawl/segments/20090909151219
>> Fetcher: threads: 10
>> QueueFeeder finished: total 1 records.
>> fetching http://10.2.44.34:8088/xwiki/
>> http.proxy.host = null
>> http.proxy.port = 8080
>> http.timeout = 10000
>> http.content.limit = -1
>> http.agent = iiith/Nutch-1.0 (kranthili2020@gmail.com)
>> protocol.plugin.check.blocking = false
>> protocol.plugin.check.robots = false
>> *Credentials - username: superadmin; set as default for realm: ; scheme:*
>> -finishing thread FetcherThread, activeThreads=1
>> -finishing thread FetcherThread, activeThreads=1
>> *Credentials - username: superadmin; set for AuthScope - host: 10.2.44.34;
>> port: 8088; realm: ; scheme:
>> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088;
>> found
>> for url: http://10.2.44.34:8088/robots.txt
>> url: http://10.2.44.34:8088/robots.txt; status code: 401; bytes received:
>> 6739; Content-Length: 6739
>> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088;
>> found
>> for url: http://10.2.44.34:8088/xwiki/
>> url: http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
>> Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
>> -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
>> * queue: http://10.2.44.34
>>  maxThreads    = 1
>>  inProgress    = 0
>>  crawlDelay    = 1000
>>  minCrawlDelay = 0
>>  nextFetchTime = 1252489344874
>>  now           = 1252489344577
>>  0. http://10.2.44.34:8088/xwiki/bin/view/Main/
>> *fetching http://10.2.44.34:8088/xwiki/bin/view/Main/
>> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088;
>> found
>> for url: http://10.2.44.34:8088/xwiki/bin/view/Main/
>> url: http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
>> received: 0; Content-Length: 0; Location:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX*
>> -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1
>> * queue: http://10.2.44.34
>>  maxThreads    = 1
>>  inProgress    = 0
>>  crawlDelay    = 1000
>>  minCrawlDelay = 0
>>  nextFetchTime = 1252489345884
>>  now           = 1252489345578
>>  0. http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>> *fetching
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088;
>> found
>> for url:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX
>> url:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX;
>> status code: 401; bytes received: 6739; Content-Length: 6739
>> 401 Authentication Required*
>> -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> -activeThreads=0
>> Fetcher: done
>>
>>
>>
>> *LOG FILE IS*
>>
>>
>> 2009-09-09 15:46:55,602 INFO  fetcher.Fetcher - fetching
>> http://10.2.44.34:8088/xwiki/
>> 2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2009-09-09 15:46:55,657 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.host = null
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http - http.proxy.port = 8080
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http - http.timeout = 10000
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http - http.content.limit = -1
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http - http.agent =
>> iiith/Nutch-1.0
>> (kranthili2020@gmail.com)
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http -
>> protocol.plugin.check.blocking = false
>> 2009-09-09 15:46:55,691 INFO  httpclient.Http -
>> protocol.plugin.check.robots
>> = false
>> 2009-09-09 15:46:55,695 DEBUG httpclient.Http - Credentials - username:
>> superadmin; set as default for realm: ; scheme:
>> 2009-09-09 15:46:55,697 DEBUG httpclient.Http - Credentials - username:
>> superadmin; set for AuthScope - host: 10.2.44.34; port: 8088; realm: ;
>> scheme:
>> *2009-09-09 15:46:55,697 DEBUG httpclient.Http - Pre-configured
>> credentials
>> with scope - host: 10.2.44.34; port: 8088; found for url:
>> http://10.2.44.34:8088/robots.txt
>> 2009-09-09 15:46:55,942 DEBUG httpclient.Http - url:
>> http://10.2.44.34:8088/robots.txt; status code: 401; bytes received:
>> 6739;
>> Content-Length: 6739
>> 2009-09-09 15:46:55,943 DEBUG httpclient.Http - Pre-configured credentials
>> with scope - host: 10.2.44.34; port: 8088; found for url:
>> http://10.2.44.34:8088/xwiki/
>> 2009-09-09 15:46:55,946 INFO  httpclient.HttpMethodDirector - Redirect
>> requested but followRedirects is disabled
>> 2009-09-09 15:46:55,946 DEBUG httpclient.Http - url:
>> http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0;
>> Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/*
>> 2009-09-09 15:46:56,657 INFO  fetcher.Fetcher - -activeThreads=1,
>> spinWaiting=1, fetchQueues.totalSize=1
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher - * queue:
>> http://10.2.44.34
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   maxThreads    = 1
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   inProgress    = 0
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   crawlDelay    = 1000
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   minCrawlDelay = 0
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   nextFetchTime =
>> 1252491417050
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   now           =
>> 1252491416658
>> 2009-09-09 15:46:56,658 INFO  fetcher.Fetcher -   0.
>> http://10.2.44.34:8088/xwiki/bin/view/Main/
>> 2009-09-09 15:46:57,051 INFO  fetcher.Fetcher - fetching
>> http://10.2.44.34:8088/xwiki/bin/view/Main/
>> 2*009-09-09 15:46:57,051 DEBUG httpclient.Http - Pre-configured
>> credentials
>>
>> with scope - host: 10.2.44.34; port: 8088; found for url:
>> http://10.2.44.34:8088/xwiki/bin/view/Main/
>> 2009-09-09 15:46:57,056 INFO  httpclient.HttpMethodDirector - Redirect
>> requested but followRedirects is disabled
>> 2009-09-09 15:46:57,057 DEBUG httpclient.Http - url:
>> http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes
>> received: 0; Content-Length: 0; Location:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1*
>> 2009-09-09 15:46:57,658 INFO  fetcher.Fetcher - -activeThreads=1,
>> spinWaiting=1, fetchQueues.totalSize=1
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher - * queue:
>> http://10.2.44.34
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   maxThreads    = 1
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   inProgress    = 0
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   crawlDelay    = 1000
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   minCrawlDelay = 0
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   nextFetchTime =
>> 1252491418057
>> 2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   now           =
>> 1252491417659
>> *2009-09-09 15:46:57,659 INFO  fetcher.Fetcher -   0.
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>> 2009-09-09 15:46:58,058 INFO  fetcher.Fetcher - fetching
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>> 2009-09-09 15:46:58,058 DEBUG httpclient.Http - Pre-configured credentials
>> with scope - host: 10.2.44.34; port: 8088; found for url:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1
>> 2009-09-09 15:46:58,170 DEBUG httpclient.Http - url:
>> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1;
>> status code: 401; bytes received: 6739; Content-Length: 6739
>> 2009-09-09 15:46:58,180 DEBUG httpclient.Http - 401 Authentication
>> Required*
>> 2009-09-09 15:46:58,180 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2009-09-09 15:46:58,659 INFO  fetcher.Fetcher - -activeThreads=0
>>
>>
>> Thank you in advance,
>>
>> bye,
>> Kranthi Reddy. B
>>
>
>
> --
>
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>   David M. Cole
> dmc@colegroup.com
>   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650)
> 557-2993
>   Consultant: The Cole Group <http://colegroup.com/>       F: (650)
> 475-8479
>
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message