incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kwri...@apache.org
Subject svn commit: r935248 [1/3] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Sat, 17 Apr 2010 20:28:20 GMT
Author: kwright
Date: Sat Apr 17 20:28:19 2010
New Revision: 935248

URL: http://svn.apache.org/viewvc?rev=935248&view=rev
Log:
Add missing Security tab description from jdbc connector, and add RSS connector description.

Added:
    incubator/lcf/site/publish/images/jdbc-job-security.PNG   (with props)
    incubator/lcf/site/publish/images/rss-configure-bandwidth.PNG   (with props)
    incubator/lcf/site/publish/images/rss-configure-email.PNG   (with props)
    incubator/lcf/site/publish/images/rss-configure-proxy.PNG   (with props)
    incubator/lcf/site/publish/images/rss-configure-robots.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-canonicalization.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-dechromed-content.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-mappings.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-metadata.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-security.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-time-values.PNG   (with props)
    incubator/lcf/site/publish/images/rss-job-urls.PNG   (with props)
    incubator/lcf/site/publish/images/rss-status.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/jdbc-job-security.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/rss-configure-bandwidth.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-configure-email.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-configure-proxy.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-configure-robots.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-canonicalization.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-dechromed-content.PNG  
(with props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-mappings.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-metadata.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-security.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-time-values.PNG   (with
props)
    incubator/lcf/site/src/documentation/resources/images/rss-job-urls.PNG   (with props)
    incubator/lcf/site/src/documentation/resources/images/rss-status.PNG   (with props)
Modified:
    incubator/lcf/site/publish/end-user-documentation.html
    incubator/lcf/site/publish/end-user-documentation.pdf
    incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml

Modified: incubator/lcf/site/publish/end-user-documentation.html
URL: http://svn.apache.org/viewvc/incubator/lcf/site/publish/end-user-documentation.html?rev=935248&r1=935247&r2=935248&view=diff
==============================================================================
--- incubator/lcf/site/publish/end-user-documentation.html (original)
+++ incubator/lcf/site/publish/end-user-documentation.html Sat Apr 17 20:28:19 2010
@@ -941,7 +941,9 @@ document.write("Last Published: " + docu
 <p>The generic file system repository connection type was developed primarily as an
example, demonstration, and testing tool, although it can potentially be useful for indexing
local
                        files that exist on the same machine that Lucene Connectors Framework
is running on.  Bear in mind that there is no support in this connection type for any kind
of
                        security, and the options are somewhat limited.</p>
-<p>The file system repository connection type provides no configuration tabs beyond
the standard ones.  However, jobs created using a file-system-type repository connection
+<p>The file system repository connection type provides no configuration tabs beyond
the standard ones.  However, please consider setting a "Maximum connections per
+                       JVM" value on the "Throttling" tab to at least one per worker thread,
or 30, for best performance.</p>
+<p>Jobs created using a file-system-type repository connection
                        have two tabs in addition to the standard repertoire: the "Hop Filters"
tab, and the "Paths" tab.</p>
 <p>The "Hop Filters" tab allows you to restrict the document set by the number of child
hops from the path root.  While this is not terribly interesting in the case of a file
                        system, the same basic functionality is also used in the web connector,
where it is a more important feature.  The file system connection type gives you a way to
see
@@ -973,13 +975,255 @@ document.write("Last Published: " + docu
                        may then add rules to it.  Each rule has a match expression, an indication
of whether the rule is intended to match files or directories, and an action (include or exclude).
                        Rules are evaluated from top to bottom, and the first rule that matches
the file name is the one that is chosen.  To add a rule, select the desired pulldowns, type
in 
                        a match file specification (e.g. "*.txt"), and click the "Add" button.</p>
-<a name="N10485"></a><a name="rssrepository"></a>
+<a name="N10488"></a><a name="rssrepository"></a>
 <h3 class="h4">Generic RSS Repository Connection</h3>
-<p>More here later</p>
-<a name="N1048F"></a><a name="webrepository"></a>
+<p>The RSS connection type is specifically designed to crawl RSS feeds.  While the
web connection type can also extract links from RSS feeds, the RSS connection type
+                       differs in the following ways:</p>
+<br>
+<ul>
+                    
+<li>Links are <b>only</b> extracted from feeds</li>
+                    
+<li>Feeds themselves are not indexed</li>
+                    
+<li>There is fine-grained control over how often feeds are refetched, and they are
treated distinctly from documents in this regard</li>
+                    
+<li>The RSS connection type knows how to carry certain data down from the feeds to
individual documents, as metadata</li>
+                
+</ul>
+<br>
+<p>Many users of the RSS connection type set up their jobs to run continuously, configuring
their jobs to never refetch documents, but rather to expire them after some 30 days.
+                       This model works reasonably well for news, which is what RSS is often
used for.</p>
+<p>A connection of the RSS connection type has the following special tabs: "Email",
"Robots", "Bandwidth", and "Proxy".  The "Email" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Email tab" src="images/rss-configure-email.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Enter an email address.  This email address will be included in all requests made
by the RSS connection, so that webmasters can report any difficulties that their
+                       sites experience as the result of improper throttling, etc.</p>
+<p>This field is mandatory.  While the RSS connection type makes no effort to validate
the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide
a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+<p>The "Robots" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Robots tab" src="images/rss-configure-robots.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Select how the connection will interpret robots.txt.  Remember that you have an
interest in crawling people's sites as politely as is possible.</p>
+<p>The "Bandwidth" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Bandwidth tab" src="images/rss-configure-bandwidth.PNG"
width="80%"></div>
+<br>
+<br>
+<p>This tab allows you to control the <b>maximum</b> rate at which the
connection fetches data, on a per-server basis, as well as the <b>maximum</b>
fetches per minute,
+                       also per-server.  Finally, the maximum number of socket connections
made per server at any one time is also controllable by this tab.</p>
+<p>The screen shot displays parameters that are
+                       considered reasonably polite.  The default values for this table are
all blank, meaning that, by default, there is no throttling whatsoever!  Please do not make
the mistake
+                       of crawling other people's sites without adequate politeness parameters
in place.</p>
+<p>The "Throttle group" parameter allows you to treat multiple RSS-type connections
together, for the purposes of throttling.  All RSS-type connections that have the same
+                       throttle group name will use the same pool for throttling purposes.</p>
+<p>The "Bandwidth" tab is related to the throttles that you can set on the "Throttling"
tab in the following ways:</p>
+<br>
+<ul>
+                    
+<li>The "Bandwidth" tab sets the <b>maximum</b> values, while the "Throttling"
tab sets the <b>average</b> values.</li>
+                    
+<li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it
simply blocks documents until it is safe to go ahead, which will use up a crawler thread
+                           for the entire period that both the wait and the fetch take place.
 The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                
+</ul>
+<br>
+<p>Because of the above, we suggest that you configure your RSS connection using <b>both</b>
the "Bandwidth" <b>and</b> the "Throttling" tabs.  Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates
on the "Throttling" tab.  Remember that a document identifier with the RSS connection type
is the
+                       document's URL, and the bin name for that URL is the server name.
 Also, please note that the "Maximum number of connections per JVM" field's default value
of 10 is
+                       unlikely to be correct for connections of the RSS type; you should
have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter
to at least a value of 30 for normal operation.</p>
+<p>The "Proxy" tab allows you to specify a proxy that you want to crawl through.  The
RSS connection type supports proxies that are secured with all forms of the NTLM
+                       authentication method.  This is quite typical of large organizations.
 The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Proxy tab" src="images/rss-configure-proxy.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Enter the proxy server you will be proxying through in the "Proxy host" field. 
Enter the proxy port in the "Proxy port" field.  If your server is authenticated, enter the
+                       domain, username, and password in the corresponding fields.  Leave
all fields blank if you want to use no proxy whatsoever.</p>
+<p>When you save your RSS connection, you should see a status screen that looks something
like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Status" src="images/rss-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>Jobs created using connections of the RSS type have the following additional tabs:
"URLs", "Canonicalization", "Mappings", "Time Values", "Security", "Metadata", and
+                       "Dechromed Content".  The URLs tab is where you describe the feeds
that are part of the job.  It looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, URLs tab" src="images/rss-job-urls.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the list of feed URLs you want to crawl, separated by newlines.  You may also
have comments by starting lines with ("#") characters.</p>
+<p>The "Canonicalization" tab controls how the job handles url canonicalization.  Canonicalization
refers to the fact that many different URLs may all refer to the
+                       same actual resource.  For example, arguments in URLs can often be
reordered, so that <span class="codefrag">a=1&amp;b=2</span> is in fact the
same as
+                       <span class="codefrag">b=2&amp;a=1</span>.  Other
canonical operations include removal of session cookies, which some dynamic web sites include
in the URL.</p>
+<p>The "Canonicalization" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Canonicalization tab" src="images/rss-job-canonicalization.PNG"
width="80%"></div>
+<br>
+<br>
+<p>The tab displays a list of canonicalization rules.  Each rule consists of a regular
expression (which is matched against a document's URL), and some switch selections.
+                       The switch selections allow you to specify whether arguments are reordered,
or whether certain specific kinds of session cookies are removed.  Specific kinds of
+                       session cookies that are recognized and can be removed are: JSP (Java
applications servers), ASP (.NET), PHP, and Broadvision (BV).</p>
+<p>If a URL matches more than one rule, the first matching rule is the one selected.</p>
+<p>To add a rule, enter an appropriate regular expression, and make your checkbox selections,
then click the "Add" button.</p>
+<p>The "Mappings" tab permits you to change the URL under which documents that are
fetched will get indexed.  This is sometimes useful in an intranet setting because
+                       the crawling server might have open access to content, while the users
may have restricted access through a somewhat different URL.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Mappings tab" src="images/rss-job-mappings.PNG"
width="80%"></div>
+<br>
+<br>
+<p>The "Mappings" tab uses the same regular expression/replacement string paradigm
as is used by many connection types running under the Framework.
+                       The mappings consist of a list of rules.  Each rule has a match expression,
which is a regular expression where parentheses ("("
+                       and ")") mark sections you are interested in.  These sections are
called "groups" in regular expression parlance.  The replace string consists of constant text
plus
+                       substitutions of the groups from the match, perhaps modified.  For
example, "$(1)" refers to the first group within the match, while "$(1l)" refers to the first
match group
+                       mapped to lower case.  Similarly, "$(1u)" refers to the same characters,
but mapped to upper case.</p>
+<p>For example, suppose you had a rule which had "http://(.*)/(.*)/" as a match expression,
and "http://$(2)/" as the replace string.  If presented with the path
+                       <span class="codefrag">http://Server/Folder_1/Filename</span>,
it would output the string <span class="codefrag">http://Folder_1/Filename</span>.</p>
+<p>If more than one rule is present, the rules are all executed in sequence.  That
is, the output of the first rule is modified by the second rule, etc.</p>
+<p>To add a rule, fill in the match expression and output string, and click the "Add"
button.</p>
+<p>The "Time Values" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Time Values tab" src="images/rss-job-time-values.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Fill in the desired time values.  A description of each value is below.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+                    
+<tr>
+<td><b>Value</b></td><td><b>Description</b></td>
+</tr>
+                    
+<tr>
+<td>Feed connect timeout</td><td>How long to wait, in seconds, before giving
up, when trying to connect to a server</td>
+</tr>
+                    
+<tr>
+<td>Default feed refetch time</td><td>If a feed specifies no refetch time,
this is the time to use instead (in minutes)</td>
+</tr>
+                    
+<tr>
+<td>Minimum feed refetch time</td><td>Never refetch feeds faster than this
specified time, regardless of what the feed says (in minutes)</td>
+</tr>
+                    
+<tr>
+<td>Bad feed refetch time</td><td>How long to wait before trying to refetch
a feed that contains parsing errors (in minutes, empty is infinity)</td>
+</tr>
+                
+</table>
+<p>The "Security" tab allows you to assign access tokens to the documents indexed with
this job.  In order to use it, you must first decide what authority connection to use
+                       to secure these documents, and what the access tokens from that authority
connection type look like.  The tab itself looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Security tab" src="images/rss-job-security.PNG"
width="80%"></div>
+<br>
+<br>
+<p>To add an access token, fill in the text box with the access token value, and click
the "Add" button.  If there are no access tokens, security will be considered
+                       to be "off" for the job.</p>
+<p>The "Metadata" tab allows you to specify arbitrary metadata to be indexed along
with every document from this job.  Documents from connections of the RSS type
+                       already receive some metadata having to do with the feed that referenced
them.  Specifically:</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+                    
+<tr>
+<td><b>Name</b></td><td><b>Meaning</b></td>
+</tr>
+                    
+<tr>
+<td>PubDate</td><td>This contains the document origination time, in milliseconds
since Jan 1, 1970.  The date is either obtained from the feed, or if it is
+                                                                absent, the date of fetch
is included instead.</td>
+</tr>
+                    
+<tr>
+<td>Source</td><td>This is the name of the feed that referred to the document.</td>
+</tr>
+                    
+<tr>
+<td>Title</td><td>This is the title of the document within the feed.</td>
+</tr>
+                    
+<tr>
+<td>Category</td><td>This is the category of the document within the feed.</td>
+</tr>
+                
+</table>
+<p>You can add additional metadata to each document using the "Metadata" tab.  The
tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Metadata tab" src="images/rss-job-metadata.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Enter the name of the metadata item you want on the left, and its desired value
on the right, and click the "Add" button to add it to the metadata list.</p>
+<p>The "Dechromed Content" tab allows you to index the description of the content from
the feed, instead of the document's contents.  This is helpful when the
+                       description of the documents in the feeds you are crawling is sufficient
for indexing purposes, and the actual documents are full of navigation clutter or "chrome".
+                       The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Dechromed Content tab" src="images/rss-job-dechromed-content.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Select the mode you want the connection to operate in.</p>
+<a name="N10609"></a><a name="webrepository"></a>
 <h3 class="h4">Generic Web Repository Connection</h3>
+<p>The Web connection type is effectively a reasonably full-featured web crawler. 
It is capable of handling most kinds of authentication (basic, all forms of NTLM,
+                       and session-based), and can extract links from the following kinds
of documents:</p>
+<br>
+<ul>
+                    
+<li>Text</li>
+                    
+<li>HTML</li>
+                    
+<li>Generic XML</li>
+                    
+<li>RSS feeds</li>
+                
+</ul>
+<br>
+<p>The Web connection type differs from the RSS connection type in the following respects:</p>
+<br>
+<ul>
+                    
+<li>Feeds are indexed, if the output connection accepts them</li>
+                    
+<li>Links are extracted from all documents, not just feeds</li>
+                    
+<li>Feeds are treated just like any other kind of document - you cannot control how
often they refetch independently</li>
+                    
+<li>There is support for limiting crawls based on hop count</li>
+                    
+<li>There is support for controlling exactly what URLs are considered part of the set,
and which are excluded</li>
+                
+</ul>
+<br>
+<p>In other words, the Web connection type is neither as easy to configure, nor as
well-targeted in its separation of links and data, as the RSS connection type.  For that
+                       reason, we strongly encourage you to consider using the RSS connection
type for all applications where it might reasonably apply.</p>
 <p>More here later</p>
-<a name="N10499"></a><a name="jcifsrepository"></a>
+<a name="N10645"></a><a name="jcifsrepository"></a>
 <h3 class="h4">Windows Share/DFS Repository Connection</h3>
 <p>The Windows Share connection type allows you to access content stored on Windows
shares, even from non-Windows systems.  Also supported are Samba and various
                        third-party Network Attached Storage servers.</p>
@@ -1007,6 +1251,9 @@ document.write("Last Published: " + docu
                        form, and provide a fully-qualified domain name in the "Domain name"
field.  The user name also should usually be unqualified, e.g. "Administrator" rather than
                        "Administrator@mydomain.com".  Sometimes it may work to leave the
"Domain name" field blank, and instead supply a fully-qualified machine name in the "Server"
                        field.  It never works to supply both a domain name <b>and</b>
a fully-qualified server name.</p>
+<p>Please note that you should probably set the "Maximum number of connections per
JVM" field, on the "Throttling" tab, to a number smaller than the default value of
+                       10, because Windows is not especially good at handling multithreaded
file requests.  A number less than 5 is likely to perform as well with less chance of causing
+                       server-side problems.</p>
 <p>After you click the "Save" button, you will see a connection summary screen, which
might look something like this:</p>
 <br>
 <br>
@@ -1031,7 +1278,9 @@ document.write("Last Published: " + docu
 <p>For each included path, a list of rules is displayed which determines what folders
and documents get included with the job.  These rules
                        will be evaluated from top to bottom, in order.  Whichever rule first
matches a given path is the one that will be used for that path.</p>
 <p>Each rule describes the path matching criteria.  This consists of the file specification
(e.g. "*.txt"), whether the path is a file or folder name, and whether a file is
-                       considered indexable or not by the output connection.  The rule also
describes the action to take should the rule be matched: include or exclude.</p>
+                       considered indexable or not by the output connection.  The rule also
describes the action to take should the rule be matched: include or exclude.  The file specification
+                       character "*" is a wildcard which matches zero or more characters,
while the character "?" matches exactly one character.  All other characters must match 
+                       exactly.</p>
 <p>To add a rule for a starting path, select the desired values of all the pulldowns,
type in the desired file criteria, and click the "Add" button.  You may also insert
                        a new rule above any existing rule, by using one of the "Insert" buttons.</p>
 <p>The "Security" tab looks like this:</p>
@@ -1096,7 +1345,7 @@ document.write("Last Published: " + docu
 <p>The mappings specified here are similar in all respects to the path attribute mapping
setup described above.  If no mappings are present, the file path is converted
                        to a canonical file IRI.  If mappings are present, the conversion
is presumed to produce a valid URL, which can be used to access the document via some
                        variety of Windows Share http server.</p>
-<a name="N10565"></a><a name="jdbcrepository"></a>
+<a name="N10714"></a><a name="jdbcrepository"></a>
 <h3 class="h4">Generic Database Repository Connection</h3>
 <p>The generic database connection type allows you to index content from a database
table, served by one of the following databases:</p>
 <br>
@@ -1272,22 +1521,32 @@ document.write("Last Published: " + docu
                 
 </table>
 <br>
-<a name="N10683"></a><a name="filenetrepository"></a>
+<p>The "Security" tab simply allows you to add specific access tokens to all documents
indexed with a general database job.  In order for you to know what tokens
+                       to add, you must decide with what authority connection these documents
will be secured, and understand the form of the access tokens used by that authority connection
+                       type.  This is what the "Security" tab looks like:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Generic Database Job, Security tab" src="images/jdbc-job-security.PNG"
width="80%"></div>
+<br>
+<br>
+<p>Enter a desired access token, and click the "Add" button.  You may enter multiple
access tokens.</p>
+<a name="N10844"></a><a name="filenetrepository"></a>
 <h3 class="h4">IBM FileNet P8 Repository Connection</h3>
 <p>More here later</p>
-<a name="N1068D"></a><a name="documentumrepository"></a>
+<a name="N1084E"></a><a name="documentumrepository"></a>
 <h3 class="h4">EMC Documentum Repository Connection</h3>
 <p>More here later</p>
-<a name="N10697"></a><a name="livelinkrepository"></a>
+<a name="N10858"></a><a name="livelinkrepository"></a>
 <h3 class="h4">OpenText LiveLink Repository Connection</h3>
 <p>More here later</p>
-<a name="N106A1"></a><a name="mexexrepository"></a>
+<a name="N10862"></a><a name="mexexrepository"></a>
 <h3 class="h4">Memex Patriarch Repository Connection</h3>
 <p>More here later</p>
-<a name="N106AB"></a><a name="meridiorepository"></a>
+<a name="N1086C"></a><a name="meridiorepository"></a>
 <h3 class="h4">Autonomy Meridio Repository Connection</h3>
 <p>More here later</p>
-<a name="N106B5"></a><a name="sharepointrepository"></a>
+<a name="N10876"></a><a name="sharepointrepository"></a>
 <h3 class="h4">Microsoft SharePoint Repository Connection</h3>
 <p>More here later</p>
 </div>



Mime
View raw message