nutch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "HttpAuthenticationSchemes" by susam
Date Sun, 04 Nov 2007 09:59:49 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
removed conf/nutch-site.xml conf

------------------------------------------------------------------------------
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. Download the patch
from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk.
  
+ == Configuration ==
+ This is an advanced feature that lets the user specify different credentials for different
authentication scopes. This section does not describe the default configuration. Some parts
of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml'
because they are correct. This section will be improved later when time permits.
- == Common Credentials Configuration ==
- This is the simplest possible configuration which involves setting just one set of credentials.
It is useful in trusted Intranets where all sites are trusted and require the same username/password
for authentication.
- 
- === Quick Guide ===
-  1. Include 'protocol-httpclient' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'.
Also, set 'http.proxy.realm' if you want to specify a realm      as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password',
'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host'
is the host where the crawler is running.
-  1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'.
Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password',
'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host'
is the host where the crawler is running.
- 
- This is explained in details in the following section.
- 
- === Details ===
- To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include some properties
which is explained in this section. First and foremost, to enable the plugin, this plugin
must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically
look like:-
- 
- {{{<property>
-   <name>plugin.includes</name>
-   <value>protocol-httpclient|urlfilter-regex|...</value>
-   <description>...</description>
- </property>}}}
- 
- (... indicates a long line truncated)
- 
- Next, if authentication is required for proxy server, the following properties need to be
set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the
domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This is the host
where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and requires authentication,
these properties should be set in 'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy authentication properties.
As you might have noticed, 'http.auth.host' is used for proxy NTLM authentication as well
as web server NTLM authentication. Since, the host at which the HTTP requests are originating
are same for both, so the same property is used for both and two different properties were
not created.
- 
- Even though, the 'http.auth.host' property is required only for NTLM authentication, it
is advisable to set this for all cases, because, in case the crawler comes across a server
which requires NTLM authentication (which you might not have anticipated), the crawler can
still fetch the page.
- 
- == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials for different
authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -98, +57 @@

  
  The 'realm' attribute is optional in <authscope> tag and it can be omitted if you
want the credentials to be used for all realms on a particular web-server (or all remaining
realms as shown in the Quick Guide section above). One authentication scope should not be
defined twice as different <authscope> tags for different <credentials> tag. However,
if this is done by mistake, the credentials for the last defined <authscope> tag would
be used. This is because, the XML parsing code, reads the file from top to bottom and sets
the credentials for authentication-scopes. If the same authentication scope is encountered
once again, it will be overwritten with the new credentials. However, one should not rely
on this behavior as this might change with further developments.
  
- 
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/
Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users.
Given that only one scheme may be used at a time for authenticating, it must choose which
scheme to use. To accompish this, it uses an order of preference to select the correct authentication
scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior
during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html
HttpClient Authentication Guide].
  

Mime
View raw message