incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r935779 [3/3] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Tue, 20 Apr 2010 00:56:25 GMT
Modified: incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml
--- incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml (original)
+++ incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml Tue Apr
20 00:56:24 2010
@@ -716,6 +716,128 @@
                 <p>In other words, the Web connection type is neither as easy to configure,
nor as well-targeted in its separation of links and data, as the RSS connection type.  For
                        reason, we strongly encourage you to consider using the RSS connection
type for all applications where it might reasonably apply.</p>
+                <p>Many users of the Web connection type set up their jobs to run continuously,
configuring their jobs to occasionally refetch documents, or to not refetch documents
+                       ever, and expire them after some period of time.</p>
+                <p>A connection of the Web connection type has the following special
tabs: "Email", "Robots", "Bandwidth", "Access Credentials", and "Certificates".  The "Email"
+                       looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-email.PNG" alt="Web Connection, Email
tab" width="80%"/>
+                <br/><br/>
+                <p>Enter an email address.  This email address will be included in
all requests made by the Web connection, so that webmasters can report any difficulties that
+                       sites experience as the result of improper throttling, etc.</p>
+                <p>This field is mandatory.  While the Web connection type makes no
effort to validate the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide
a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+                <p>The "Robots" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-robots.PNG" alt="Web Connection, Robots
tab" width="80%"/>
+                <br/><br/>
+                <p>Select how the connection will interpret robots.txt.  Remember that
you have an interest in crawling people's sites as politely as is possible.</p>
+                <p>The "Bandwidth" tab allows you to specify a list of bandwidth rules.
 Each rule has a regular expression matched against a URL's throttle bin.
+                       Throttle bins, in connections of the Web type, are simply the server
name part of the URL.  Each rule allows you to select a maximum bandwidth, number of
+                       connections, and fetch rate.  You can have as many rules as you like;
if a URL matches more than one rule, then the most conservative value will be used.</p>
+                <p>This is what the "Bandwidth" tab looks like:</p>
+                <br/><br/>
+                <figure src="images/web-configure-bandwidth.PNG" alt="Web Connection,
Bandwidth tab" width="80%"/>
+                <br/><br/>
+                <p>The screen shot shows the tab configured with a setting that is
reasonably polite.  The default value for this tab is blank, meaning that, by default, there
is no throttling
+                       whatsoever!  Please do not make the mistake of crawling other people's
sites without adequate politeness parameters in place.</p>
+                <p>To add a rule, fill in the regular expression and the appropriate
rule limit values, and click the "Add" button.</p>
+                <p>The "Bandwidth" tab is related to the throttles that you can set
on the "Throttling" tab in the following ways:</p>
+                <br/>
+                <ul>
+                    <li>The "Bandwidth" tab sets the <b>maximum</b> values,
while the "Throttling" tab sets the <b>average</b> values.</li>
+                    <li>The "Bandwidth" tab does not affect how documents are scheduled
in the queue; it simply blocks documents until it is safe to go ahead, which will use up a
crawler thread
+                           for the entire period that both the wait and the fetch take place.
 The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                </ul>
+                <br/>
+                <p>Because of the above, we suggest that you configure your Web connection
using <b>both</b> the "Bandwidth" <b>and</b> the "Throttling" tabs.
 Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates
on the "Throttling" tab.  Remember that a document identifier with the Web connection type
is the
+                       document's URL, and the bin name for that URL is the server name.
 Also, please note that the "Maximum number of connections per JVM" field's default value
of 10 is
+                       unlikely to be correct for connections of the Web type; you should
have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter
to at least a value of 30 for normal operation.</p>
+                <p>The Web connection type's "Access Credentials" tab describes how
pages get authenticated.  There is support on this tab for both page-based authentication
+                       basic auth or all forms of NTLM), as well as session-based authentication
(which involves the fetch of many pages to establish a logged-in session).  The initial
+                       appearance of the "Access Credentials" tab shows both kinds of authentication:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials.PNG" alt="Web Connection,
Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Each kind of authentication has its own list of rules.</p>
+                <p>Specifying a page authentication rule requires simply knowing what
URLs are protected, and what the proper
+                       authentication method and credentials are for those URLs.  Enter a
regular expression describing the protected URLs, and select the proper authentication method.
+                       Fill in the credentials.  Click the "Add" button.</p>
+                <p>Specifying a correct session authentication rule usually requires
some research.  A single session-authentication rule usually corresponds to a single session-protected
+                       site.  For that site, you will need to be able to describe the following
for session authentication to function:</p>
+                <br/>
+                <ul>
+                    <li>The URLs of pages that are protected by this particular site
session security</li>
+                    <li>How to detect when a page fetch is part of the login sequence</li>
+                    <li>How to fill in the appropriate forms within the login sequence
with appropriate login information</li>
+                </ul>
+                <br/>
+                <p>The Web connection type labels pages that are part of the login
sequence "login pages", and pages that are protected site content "content pages".  The Web
+                       connection type will not attempt to index login pages.  They are special
pages that have but one purpose: establishing an authenticated session.</p>
+                <p>If all this is not complicated enough, your research also has to
cover two very different cases: when you are first entering the site anew, and second when
you try to fetch
+                       a content page and you are no longer logged in, because your session
has expired.  In both cases, the session authentication rule must be able to properly log
in and
+                       fetch content, because you cannot control when a page will be fetched
or refetched by the Framework.</p>
+                <p>You declare a page to be a login page by identifying it both by
its URL, and by what the crawler finds on the page when it fetches it.  For example, some
+                       sites may redirect you to a login screen when your session expires.
 So, instead of fetching content, you would be fetching a redirection to a specific page.
 You do <b>not</b>
+                       want either the redirection, or the login screen, to be considered
content pages.  The correct way to handle such a setup would be to declare one kind of login
page to consist
+                       of a redirection to the login screen URL, and another kind of login
page to consist of the login screen URL with the appropriate form.  Furthermore, you would
want to supply
+                       the correct login data for the form, and allow the form to be submitted,
and so the login form's target may also need to be declared as a login page.</p>
+                <p>The kinds of content that the Web connection type can recognize
as a login page are the following:</p>
+                <br/>
+                <ul>
+                    <li>A redirection to a specific URL, as described by a regular
+                    <li>A page that has a form of a particular name on it, as described
by a regular expression</li>
+                    <li>A page that has a link on it to a specific target, as described
by a regular expression</li>
+                </ul>
+                <br/>
+                <p>To add a session authentication rule, fill in a regular expression
describing the site pages that are being protected, and click the "Add" button:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials-session.PNG" alt="Web
Connection, Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Note that you can now add login page descriptions to the newly-created
rule.  To add a login page description, enter a URL regular expression, a type of login page,
+                       target link or form name regular expression, and click the "Add" button.</p>
+                <p>When you add a login page of the "form" type, you can then add form
fill-in information to the login page, as seen below:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials-session-form.PNG"
alt="Web Connection, Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Supply a regular expression for the name of the form element you
want to set, and also provide a value.  If you want the value to not be visible in clear text,
fill in the
+                       "password" column instead of the "value" column.  You can usually
figure out the name of the form and its elements by viewing the source of the HTML page in
+                       browser.  When you are done, click the "Add" button.</p>
+                <p>Form data that is not specified will be posted with the default
value determined by the HTML of the page.  The Web connection type is unable, at this time,
to execute
+                       Javascript, and therefore you may need to fill out some form values
that are filled in by Javascript in order to get the form to post in a useful way.  If you
have a form
+                       that relies heavily on Javascript to post properly, you may need considerable
effort and web programming skills to figure out how to get these forms to post properly
+                       with the Web Connector.  Luckily, such obfuscated login screens are
still rare.</p>
+                <p>A series of login pages form a "login page sequence" for the site.
 For each login page, the Web connection decides what page to fetch next by what you specified
+                       the login page criteria.  So, for a redirection to a specific URL,
the next page to be fetched will be that redirected URL.  For a form, the next page fetched
will be the
+                       action page indicated by the specified form.  For a link to a target,
the next page fetched will be the target URL.  When the login page sequence ends, the next
+                       fetched after that will be the original content page that the Web
connection was trying to fetch when the login sequence started.</p>
+                <p>Debugging session authentication problems is best done by looking
at a Simple History report for your Web connection.  The Web connection type records several
+                       types of events which, between them, can give a very strong picture
of what is happening.  These event types are as follows:</p>
+                <br/>
+                <table>
+                    <tr><td><b>Event type</b></td><td><b>Meaning</b></td></tr>
+                    <tr><td>Fetch</td><td>This event records the
fetch of a URL.  The HTTP response is recorded as the response code.  In addition, there are
several negative
+                        code values which the connect generates when the HTTP operation cannot
be done or does not complete.</td></tr>
+                    <tr><td>Begin login</td><td>This event occurs
when the connection detects the transition to a login page sequence.  When a login sequence
is entered, no other
+                        pages from that protected site will be fetched until the login sequence
is completed.</td></tr>
+                    <tr><td>End login</td><td>This event occurs when
the connection detects the transition from a login page sequence back to normal content fetching.
 When this
+                        occurs, simultaneous fetching for pages from the site are re-enabled.</td></tr>
+                </table>
+                <br/>
+                <p>The "Certificates" tab is used in conjunction with SSL, and permits
you to define independent trust certificate stores for URLs matching specified regular expressions.
+                       You can also allow the connection to trust all certificates it sees,
if you so choose.  The "Certificates" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-certificates.PNG" alt="Web Connection,
Certificates tab" width="80%"/>
+                <br/><br/>
+                <p>Type in a URL regular expression, and either check the "Trust everything"
box, or browse for the appropriate certificate authority certificate that you wish to trust.
 (It will
+                       also work to simply trust a server's certificate, but that certificate
may change from time to time, as it expires.)  Click "Add" to add the certificate rule to
the list.</p>
+                <p>When you are done, and you click the "Save" button, you will see
a summary page looking something like this:</p>
+                <br/><br/>
+                <figure src="images/web-status.PNG" alt="Web Status" width="80%"/>
+                <br/><br/>
                 <p>More here later</p>

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session-form.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session-form.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-certificates.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-certificates.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-status.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-status.PNG
    svn:mime-type = application/octet-stream

View raw message