incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r935779 [1/3] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Tue, 20 Apr 2010 00:56:25 GMT
Author: kwright
Date: Tue Apr 20 00:56:24 2010
New Revision: 935779

Add description of web repository connection.

(with props)
    incubator/lcf/site/publish/images/web-configure-access-credentials-session.PNG   (with
    incubator/lcf/site/publish/images/web-configure-access-credentials.PNG   (with props)
    incubator/lcf/site/publish/images/web-configure-bandwidth.PNG   (with props)
    incubator/lcf/site/publish/images/web-configure-certificates.PNG   (with props)
    incubator/lcf/site/publish/images/web-configure-email.PNG   (with props)
    incubator/lcf/site/publish/images/web-configure-robots.PNG   (with props)
    incubator/lcf/site/publish/images/web-status.PNG   (with props)
  (with props)
  (with props)
  (with props)
    incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG   (with
 (with props)
    incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG   (with
    incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG   (with
    incubator/lcf/site/src/documentation/resources/images/web-status.PNG   (with props)

Modified: incubator/lcf/site/publish/end-user-documentation.html
--- incubator/lcf/site/publish/end-user-documentation.html (original)
+++ incubator/lcf/site/publish/end-user-documentation.html Tue Apr 20 00:56:24 2010
@@ -1222,8 +1222,178 @@ document.write("Last Published: " + docu
 <p>In other words, the Web connection type is neither as easy to configure, nor as
well-targeted in its separation of links and data, as the RSS connection type.  For that
                        reason, we strongly encourage you to consider using the RSS connection
type for all applications where it might reasonably apply.</p>
+<p>Many users of the Web connection type set up their jobs to run continuously, configuring
their jobs to occasionally refetch documents, or to not refetch documents
+                       ever, and expire them after some period of time.</p>
+<p>A connection of the Web connection type has the following special tabs: "Email",
"Robots", "Bandwidth", "Access Credentials", and "Certificates".  The "Email" tab
+                       looks like this:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Email tab" src="images/web-configure-email.PNG"
+<p>Enter an email address.  This email address will be included in all requests made
by the Web connection, so that webmasters can report any difficulties that their
+                       sites experience as the result of improper throttling, etc.</p>
+<p>This field is mandatory.  While the Web connection type makes no effort to validate
the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide
a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+<p>The "Robots" tab looks like this:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Robots tab" src="images/web-configure-robots.PNG"
+<p>Select how the connection will interpret robots.txt.  Remember that you have an
interest in crawling people's sites as politely as is possible.</p>
+<p>The "Bandwidth" tab allows you to specify a list of bandwidth rules.  Each rule
has a regular expression matched against a URL's throttle bin.
+                       Throttle bins, in connections of the Web type, are simply the server
name part of the URL.  Each rule allows you to select a maximum bandwidth, number of
+                       connections, and fetch rate.  You can have as many rules as you like;
if a URL matches more than one rule, then the most conservative value will be used.</p>
+<p>This is what the "Bandwidth" tab looks like:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Bandwidth tab" src="images/web-configure-bandwidth.PNG"
+<p>The screen shot shows the tab configured with a setting that is reasonably polite.
 The default value for this tab is blank, meaning that, by default, there is no throttling
+                       whatsoever!  Please do not make the mistake of crawling other people's
sites without adequate politeness parameters in place.</p>
+<p>To add a rule, fill in the regular expression and the appropriate rule limit values,
and click the "Add" button.</p>
+<p>The "Bandwidth" tab is related to the throttles that you can set on the "Throttling"
tab in the following ways:</p>
+<li>The "Bandwidth" tab sets the <b>maximum</b> values, while the "Throttling"
tab sets the <b>average</b> values.</li>
+<li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it
simply blocks documents until it is safe to go ahead, which will use up a crawler thread
+                           for the entire period that both the wait and the fetch take place.
 The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+<p>Because of the above, we suggest that you configure your Web connection using <b>both</b>
the "Bandwidth" <b>and</b> the "Throttling" tabs.  Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates
on the "Throttling" tab.  Remember that a document identifier with the Web connection type
is the
+                       document's URL, and the bin name for that URL is the server name.
 Also, please note that the "Maximum number of connections per JVM" field's default value
of 10 is
+                       unlikely to be correct for connections of the Web type; you should
have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter
to at least a value of 30 for normal operation.</p>
+<p>The Web connection type's "Access Credentials" tab describes how pages get authenticated.
 There is support on this tab for both page-based authentication (e.g.
+                       basic auth or all forms of NTLM), as well as session-based authentication
(which involves the fetch of many pages to establish a logged-in session).  The initial
+                       appearance of the "Access Credentials" tab shows both kinds of authentication:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Access Credentials tab" src="images/web-configure-access-credentials.PNG"
+<p>Each kind of authentication has its own list of rules.</p>
+<p>Specifying a page authentication rule requires simply knowing what URLs are protected,
and what the proper
+                       authentication method and credentials are for those URLs.  Enter a
regular expression describing the protected URLs, and select the proper authentication method.
+                       Fill in the credentials.  Click the "Add" button.</p>
+<p>Specifying a correct session authentication rule usually requires some research.
 A single session-authentication rule usually corresponds to a single session-protected
+                       site.  For that site, you will need to be able to describe the following
for session authentication to function:</p>
+<li>The URLs of pages that are protected by this particular site session security</li>
+<li>How to detect when a page fetch is part of the login sequence</li>
+<li>How to fill in the appropriate forms within the login sequence with appropriate
login information</li>
+<p>The Web connection type labels pages that are part of the login sequence "login
pages", and pages that are protected site content "content pages".  The Web
+                       connection type will not attempt to index login pages.  They are special
pages that have but one purpose: establishing an authenticated session.</p>
+<p>If all this is not complicated enough, your research also has to cover two very
different cases: when you are first entering the site anew, and second when you try to fetch
+                       a content page and you are no longer logged in, because your session
has expired.  In both cases, the session authentication rule must be able to properly log
in and
+                       fetch content, because you cannot control when a page will be fetched
or refetched by the Framework.</p>
+<p>You declare a page to be a login page by identifying it both by its URL, and by
what the crawler finds on the page when it fetches it.  For example, some session-protected
+                       sites may redirect you to a login screen when your session expires.
 So, instead of fetching content, you would be fetching a redirection to a specific page.
 You do <b>not</b>
+                       want either the redirection, or the login screen, to be considered
content pages.  The correct way to handle such a setup would be to declare one kind of login
page to consist
+                       of a redirection to the login screen URL, and another kind of login
page to consist of the login screen URL with the appropriate form.  Furthermore, you would
want to supply
+                       the correct login data for the form, and allow the form to be submitted,
and so the login form's target may also need to be declared as a login page.</p>
+<p>The kinds of content that the Web connection type can recognize as a login page
are the following:</p>
+<li>A redirection to a specific URL, as described by a regular expression</li>
+<li>A page that has a form of a particular name on it, as described by a regular expression</li>
+<li>A page that has a link on it to a specific target, as described by a regular expression</li>
+<p>To add a session authentication rule, fill in a regular expression describing the
site pages that are being protected, and click the "Add" button:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Access Credentials tab" src="images/web-configure-access-credentials-session.PNG"
+<p>Note that you can now add login page descriptions to the newly-created rule.  To
add a login page description, enter a URL regular expression, a type of login page, a
+                       target link or form name regular expression, and click the "Add" button.</p>
+<p>When you add a login page of the "form" type, you can then add form fill-in information
to the login page, as seen below:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Access Credentials tab" src="images/web-configure-access-credentials-session-form.PNG"
+<p>Supply a regular expression for the name of the form element you want to set, and
also provide a value.  If you want the value to not be visible in clear text, fill in the
+                       "password" column instead of the "value" column.  You can usually
figure out the name of the form and its elements by viewing the source of the HTML page in
+                       browser.  When you are done, click the "Add" button.</p>
+<p>Form data that is not specified will be posted with the default value determined
by the HTML of the page.  The Web connection type is unable, at this time, to execute
+                       Javascript, and therefore you may need to fill out some form values
that are filled in by Javascript in order to get the form to post in a useful way.  If you
have a form
+                       that relies heavily on Javascript to post properly, you may need considerable
effort and web programming skills to figure out how to get these forms to post properly
+                       with the Web Connector.  Luckily, such obfuscated login screens are
still rare.</p>
+<p>A series of login pages form a "login page sequence" for the site.  For each login
page, the Web connection decides what page to fetch next by what you specified for
+                       the login page criteria.  So, for a redirection to a specific URL,
the next page to be fetched will be that redirected URL.  For a form, the next page fetched
will be the
+                       action page indicated by the specified form.  For a link to a target,
the next page fetched will be the target URL.  When the login page sequence ends, the next
+                       fetched after that will be the original content page that the Web
connection was trying to fetch when the login sequence started.</p>
+<p>Debugging session authentication problems is best done by looking at a Simple History
report for your Web connection.  The Web connection type records several
+                       types of events which, between them, can give a very strong picture
of what is happening.  These event types are as follows:</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+<td><b>Event type</b></td><td><b>Meaning</b></td>
+<td>Fetch</td><td>This event records the fetch of a URL.  The HTTP response
is recorded as the response code.  In addition, there are several negative
+                        code values which the connect generates when the HTTP operation cannot
be done or does not complete.</td>
+<td>Begin login</td><td>This event occurs when the connection detects the
transition to a login page sequence.  When a login sequence is entered, no other
+                        pages from that protected site will be fetched until the login sequence
is completed.</td>
+<td>End login</td><td>This event occurs when the connection detects the
transition from a login page sequence back to normal content fetching.  When this
+                        occurs, simultaneous fetching for pages from the site are re-enabled.</td>
+<p>The "Certificates" tab is used in conjunction with SSL, and permits you to define
independent trust certificate stores for URLs matching specified regular expressions.
+                       You can also allow the connection to trust all certificates it sees,
if you so choose.  The "Certificates" tab looks like this:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Connection, Certificates tab" src="images/web-configure-certificates.PNG"
+<p>Type in a URL regular expression, and either check the "Trust everything" box, or
browse for the appropriate certificate authority certificate that you wish to trust.  (It
+                       also work to simply trust a server's certificate, but that certificate
may change from time to time, as it expires.)  Click "Add" to add the certificate rule to
the list.</p>
+<p>When you are done, and you click the "Save" button, you will see a summary page
looking something like this:</p>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Web Status" src="images/web-status.PNG" width="80%"></div>
 <p>More here later</p>
-<a name="N10645"></a><a name="jcifsrepository"></a>
+<a name="N10754"></a><a name="jcifsrepository"></a>
 <h3 class="h4">Windows Share/DFS Repository Connection</h3>
 <p>The Windows Share connection type allows you to access content stored on Windows
shares, even from non-Windows systems.  Also supported are Samba and various
                        third-party Network Attached Storage servers.</p>
@@ -1345,7 +1515,7 @@ document.write("Last Published: " + docu
 <p>The mappings specified here are similar in all respects to the path attribute mapping
setup described above.  If no mappings are present, the file path is converted
                        to a canonical file IRI.  If mappings are present, the conversion
is presumed to produce a valid URL, which can be used to access the document via some
                        variety of Windows Share http server.</p>
-<a name="N10714"></a><a name="jdbcrepository"></a>
+<a name="N10823"></a><a name="jdbcrepository"></a>
 <h3 class="h4">Generic Database Repository Connection</h3>
 <p>The generic database connection type allows you to index content from a database
table, served by one of the following databases:</p>
@@ -1531,22 +1701,22 @@ document.write("Last Published: " + docu
 <p>Enter a desired access token, and click the "Add" button.  You may enter multiple
access tokens.</p>
-<a name="N10844"></a><a name="filenetrepository"></a>
+<a name="N10953"></a><a name="filenetrepository"></a>
 <h3 class="h4">IBM FileNet P8 Repository Connection</h3>
 <p>More here later</p>
-<a name="N1084E"></a><a name="documentumrepository"></a>
+<a name="N1095D"></a><a name="documentumrepository"></a>
 <h3 class="h4">EMC Documentum Repository Connection</h3>
 <p>More here later</p>
-<a name="N10858"></a><a name="livelinkrepository"></a>
+<a name="N10967"></a><a name="livelinkrepository"></a>
 <h3 class="h4">OpenText LiveLink Repository Connection</h3>
 <p>More here later</p>
-<a name="N10862"></a><a name="mexexrepository"></a>
+<a name="N10971"></a><a name="mexexrepository"></a>
 <h3 class="h4">Memex Patriarch Repository Connection</h3>
 <p>More here later</p>
-<a name="N1086C"></a><a name="meridiorepository"></a>
+<a name="N1097B"></a><a name="meridiorepository"></a>
 <h3 class="h4">Autonomy Meridio Repository Connection</h3>
 <p>More here later</p>
-<a name="N10876"></a><a name="sharepointrepository"></a>
+<a name="N10985"></a><a name="sharepointrepository"></a>
 <h3 class="h4">Microsoft SharePoint Repository Connection</h3>
 <p>More here later</p>

View raw message