manifoldcf-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r935248 [3/3] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Sat, 17 Apr 2010 20:28:20 GMT
Modified: incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml
--- incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml (original)
+++ incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml Sat Apr
17 20:28:19 2010
@@ -523,7 +523,9 @@
                 <p>The generic file system repository connection type was developed
primarily as an example, demonstration, and testing tool, although it can potentially be useful
for indexing local
                        files that exist on the same machine that Lucene Connectors Framework
is running on.  Bear in mind that there is no support in this connection type for any kind
                        security, and the options are somewhat limited.</p>
-                <p>The file system repository connection type provides no configuration
tabs beyond the standard ones.  However, jobs created using a file-system-type repository
+                <p>The file system repository connection type provides no configuration
tabs beyond the standard ones.  However, please consider setting a "Maximum connections per
+                       JVM" value on the "Throttling" tab to at least one per worker thread,
or 30, for best performance.</p>
+                <p>Jobs created using a file-system-type repository connection
                        have two tabs in addition to the standard repertoire: the "Hop Filters"
tab, and the "Paths" tab.</p>
                 <p>The "Hop Filters" tab allows you to restrict the document set by
the number of child hops from the path root.  While this is not terribly interesting in the
case of a file
                        system, the same basic functionality is also used in the web connector,
where it is a more important feature.  The file system connection type gives you a way to
@@ -554,12 +556,169 @@
             <section id="rssrepository">
                 <title>Generic RSS Repository Connection</title>
-                <p>More here later</p>
+                <p>The RSS connection type is specifically designed to crawl RSS feeds.
 While the web connection type can also extract links from RSS feeds, the RSS connection type
+                       differs in the following ways:</p>
+                <br/>
+                <ul>
+                    <li>Links are <b>only</b> extracted from feeds</li>
+                    <li>Feeds themselves are not indexed</li>
+                    <li>There is fine-grained control over how often feeds are refetched,
and they are treated distinctly from documents in this regard</li>
+                    <li>The RSS connection type knows how to carry certain data down
from the feeds to individual documents, as metadata</li>
+                </ul>
+                <br/>
+                <p>Many users of the RSS connection type set up their jobs to run continuously,
configuring their jobs to never refetch documents, but rather to expire them after some 30
+                       This model works reasonably well for news, which is what RSS is often
used for.</p>
+                <p>A connection of the RSS connection type has the following special
tabs: "Email", "Robots", "Bandwidth", and "Proxy".  The "Email" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-configure-email.PNG" alt="RSS Connection, Email
tab" width="80%"/>
+                <br/><br/>
+                <p>Enter an email address.  This email address will be included in
all requests made by the RSS connection, so that webmasters can report any difficulties that
+                       sites experience as the result of improper throttling, etc.</p>
+                <p>This field is mandatory.  While the RSS connection type makes no
effort to validate the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide
a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+                <p>The "Robots" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-configure-robots.PNG" alt="RSS Connection, Robots
tab" width="80%"/>
+                <br/><br/>
+                <p>Select how the connection will interpret robots.txt.  Remember that
you have an interest in crawling people's sites as politely as is possible.</p>
+                <p>The "Bandwidth" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-configure-bandwidth.PNG" alt="RSS Connection,
Bandwidth tab" width="80%"/>
+                <br/><br/>
+                <p>This tab allows you to control the <b>maximum</b> rate
at which the connection fetches data, on a per-server basis, as well as the <b>maximum</b>
fetches per minute,
+                       also per-server.  Finally, the maximum number of socket connections
made per server at any one time is also controllable by this tab.</p>
+                <p>The screen shot displays parameters that are
+                       considered reasonably polite.  The default values for this table are
all blank, meaning that, by default, there is no throttling whatsoever!  Please do not make
the mistake
+                       of crawling other people's sites without adequate politeness parameters
in place.</p>
+                <p>The "Throttle group" parameter allows you to treat multiple RSS-type
connections together, for the purposes of throttling.  All RSS-type connections that have
the same
+                       throttle group name will use the same pool for throttling purposes.</p>
+                <p>The "Bandwidth" tab is related to the throttles that you can set
on the "Throttling" tab in the following ways:</p>
+                <br/>
+                <ul>
+                    <li>The "Bandwidth" tab sets the <b>maximum</b> values,
while the "Throttling" tab sets the <b>average</b> values.</li>
+                    <li>The "Bandwidth" tab does not affect how documents are scheduled
in the queue; it simply blocks documents until it is safe to go ahead, which will use up a
crawler thread
+                           for the entire period that both the wait and the fetch take place.
 The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                </ul>
+                <br/>
+                <p>Because of the above, we suggest that you configure your RSS connection
using <b>both</b> the "Bandwidth" <b>and</b> the "Throttling" tabs.
 Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates
on the "Throttling" tab.  Remember that a document identifier with the RSS connection type
is the
+                       document's URL, and the bin name for that URL is the server name.
 Also, please note that the "Maximum number of connections per JVM" field's default value
of 10 is
+                       unlikely to be correct for connections of the RSS type; you should
have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter
to at least a value of 30 for normal operation.</p>
+                <p>The "Proxy" tab allows you to specify a proxy that you want to crawl
through.  The RSS connection type supports proxies that are secured with all forms of the
+                       authentication method.  This is quite typical of large organizations.
 The tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-configure-proxy.PNG" alt="RSS Connection, Proxy
tab" width="80%"/>
+                <br/><br/>
+                <p>Enter the proxy server you will be proxying through in the "Proxy
host" field.  Enter the proxy port in the "Proxy port" field.  If your server is authenticated,
enter the
+                       domain, username, and password in the corresponding fields.  Leave
all fields blank if you want to use no proxy whatsoever.</p>
+                <p>When you save your RSS connection, you should see a status screen
that looks something like this:</p>
+                <br/><br/>
+                <figure src="images/rss-status.PNG" alt="RSS Status" width="80%"/>
+                <br/><br/>
+                <p>Jobs created using connections of the RSS type have the following
additional tabs: "URLs", "Canonicalization", "Mappings", "Time Values", "Security", "Metadata",
+                       "Dechromed Content".  The URLs tab is where you describe the feeds
that are part of the job.  It looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-urls.PNG" alt="RSS job, URLs tab" width="80%"/>
+                <br/><br/>
+                <p>Enter the list of feed URLs you want to crawl, separated by newlines.
 You may also have comments by starting lines with ("#") characters.</p>
+                <p>The "Canonicalization" tab controls how the job handles url canonicalization.
 Canonicalization refers to the fact that many different URLs may all refer to the
+                       same actual resource.  For example, arguments in URLs can often be
reordered, so that <code>a=1&amp;b=2</code> is in fact the same as
+                       <code>b=2&amp;a=1</code>.  Other canonical operations
include removal of session cookies, which some dynamic web sites include in the URL.</p>
+                <p>The "Canonicalization" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-canonicalization.PNG" alt="RSS job, Canonicalization
tab" width="80%"/>
+                <br/><br/>
+                <p>The tab displays a list of canonicalization rules.  Each rule consists
of a regular expression (which is matched against a document's URL), and some switch selections.
+                       The switch selections allow you to specify whether arguments are reordered,
or whether certain specific kinds of session cookies are removed.  Specific kinds of
+                       session cookies that are recognized and can be removed are: JSP (Java
applications servers), ASP (.NET), PHP, and Broadvision (BV).</p>
+                <p>If a URL matches more than one rule, the first matching rule is
the one selected.</p>
+                <p>To add a rule, enter an appropriate regular expression, and make
your checkbox selections, then click the "Add" button.</p>
+                <p>The "Mappings" tab permits you to change the URL under which documents
that are fetched will get indexed.  This is sometimes useful in an intranet setting because
+                       the crawling server might have open access to content, while the users
may have restricted access through a somewhat different URL.  The tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-mappings.PNG" alt="RSS job, Mappings tab"
+                <br/><br/>
+                <p>The "Mappings" tab uses the same regular expression/replacement
string paradigm as is used by many connection types running under the Framework.
+                       The mappings consist of a list of rules.  Each rule has a match expression,
which is a regular expression where parentheses ("("
+                       and ")") mark sections you are interested in.  These sections are
called "groups" in regular expression parlance.  The replace string consists of constant text
+                       substitutions of the groups from the match, perhaps modified.  For
example, "$(1)" refers to the first group within the match, while "$(1l)" refers to the first
match group
+                       mapped to lower case.  Similarly, "$(1u)" refers to the same characters,
but mapped to upper case.</p>
+                <p>For example, suppose you had a rule which had "http://(.*)/(.*)/"
as a match expression, and "http://$(2)/" as the replace string.  If presented with the path
+                       <code>http://Server/Folder_1/Filename</code>, it would
output the string <code>http://Folder_1/Filename</code>.</p>
+                <p>If more than one rule is present, the rules are all executed in
sequence.  That is, the output of the first rule is modified by the second rule, etc.</p>
+                <p>To add a rule, fill in the match expression and output string, and
click the "Add" button.</p>
+                <p>The "Time Values" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-time-values.PNG" alt="RSS job, Time Values
tab" width="80%"/>
+                <br/><br/>
+                <p>Fill in the desired time values.  A description of each value is
+                <table>
+                    <tr><td><b>Value</b></td><td><b>Description</b></td></tr>
+                    <tr><td>Feed connect timeout</td><td>How long
to wait, in seconds, before giving up, when trying to connect to a server</td></tr>
+                    <tr><td>Default feed refetch time</td><td>If
a feed specifies no refetch time, this is the time to use instead (in minutes)</td></tr>
+                    <tr><td>Minimum feed refetch time</td><td>Never
refetch feeds faster than this specified time, regardless of what the feed says (in minutes)</td></tr>
+                    <tr><td>Bad feed refetch time</td><td>How long
to wait before trying to refetch a feed that contains parsing errors (in minutes, empty is
+                </table>
+                <p>The "Security" tab allows you to assign access tokens to the documents
indexed with this job.  In order to use it, you must first decide what authority connection
to use
+                       to secure these documents, and what the access tokens from that authority
connection type look like.  The tab itself looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-security.PNG" alt="RSS job, Security tab"
+                <br/><br/>
+                <p>To add an access token, fill in the text box with the access token
value, and click the "Add" button.  If there are no access tokens, security will be considered
+                       to be "off" for the job.</p>
+                <p>The "Metadata" tab allows you to specify arbitrary metadata to be
indexed along with every document from this job.  Documents from connections of the RSS type
+                       already receive some metadata having to do with the feed that referenced
them.  Specifically:</p>
+                <table>
+                    <tr><td><b>Name</b></td><td><b>Meaning</b></td></tr>
+                    <tr><td>PubDate</td><td>This contains the document
origination time, in milliseconds since Jan 1, 1970.  The date is either obtained from the
feed, or if it is
+                                                                absent, the date of fetch
is included instead.</td></tr>
+                    <tr><td>Source</td><td>This is the name of the
feed that referred to the document.</td></tr>
+                    <tr><td>Title</td><td>This is the title of the
document within the feed.</td></tr>
+                    <tr><td>Category</td><td>This is the category
of the document within the feed.</td></tr>
+                </table>
+                <p>You can add additional metadata to each document using the "Metadata"
tab.  The tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-metadata.PNG" alt="RSS job, Metadata tab"
+                <br/><br/>
+                <p>Enter the name of the metadata item you want on the left, and its
desired value on the right, and click the "Add" button to add it to the metadata list.</p>
+                <p>The "Dechromed Content" tab allows you to index the description
of the content from the feed, instead of the document's contents.  This is helpful when the
+                       description of the documents in the feeds you are crawling is sufficient
for indexing purposes, and the actual documents are full of navigation clutter or "chrome".
+                       The tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/rss-job-dechromed-content.PNG" alt="RSS job, Dechromed
Content tab" width="80%"/>
+                <br/><br/>
+                <p>Select the mode you want the connection to operate in.</p>
             <section id="webrepository">
                 <title>Generic Web Repository Connection</title>
+                <p>The Web connection type is effectively a reasonably full-featured
web crawler.  It is capable of handling most kinds of authentication (basic, all forms of
+                       and session-based), and can extract links from the following kinds
of documents:</p>
+                <br/>
+                <ul>
+                    <li>Text</li>
+                    <li>HTML</li>
+                    <li>Generic XML</li>
+                    <li>RSS feeds</li>
+                </ul>
+                <br/>
+                <p>The Web connection type differs from the RSS connection type in
the following respects:</p>
+                <br/>
+                <ul>
+                    <li>Feeds are indexed, if the output connection accepts them</li>
+                    <li>Links are extracted from all documents, not just feeds</li>
+                    <li>Feeds are treated just like any other kind of document - you
cannot control how often they refetch independently</li>
+                    <li>There is support for limiting crawls based on hop count</li>
+                    <li>There is support for controlling exactly what URLs are considered
part of the set, and which are excluded</li>
+                </ul>
+                <br/>
+                <p>In other words, the Web connection type is neither as easy to configure,
nor as well-targeted in its separation of links and data, as the RSS connection type.  For
+                       reason, we strongly encourage you to consider using the RSS connection
type for all applications where it might reasonably apply.</p>
                 <p>More here later</p>
             <section id="jcifsrepository">
@@ -587,6 +746,9 @@
                        form, and provide a fully-qualified domain name in the "Domain name"
field.  The user name also should usually be unqualified, e.g. "Administrator" rather than
                        "".  Sometimes it may work to leave the
"Domain name" field blank, and instead supply a fully-qualified machine name in the "Server"
                        field.  It never works to supply both a domain name <b>and</b>
a fully-qualified server name.</p>
+                <p>Please note that you should probably set the "Maximum number of
connections per JVM" field, on the "Throttling" tab, to a number smaller than the default
value of
+                       10, because Windows is not especially good at handling multithreaded
file requests.  A number less than 5 is likely to perform as well with less chance of causing
+                       server-side problems.</p>
                 <p>After you click the "Save" button, you will see a connection summary
screen, which might look something like this:</p>
                 <figure src="images/jcifs-status.PNG" alt="Windows Share Status" width="80%"/>
@@ -605,7 +767,9 @@
                 <p>For each included path, a list of rules is displayed which determines
what folders and documents get included with the job.  These rules
                        will be evaluated from top to bottom, in order.  Whichever rule first
matches a given path is the one that will be used for that path.</p>
                 <p>Each rule describes the path matching criteria.  This consists of
the file specification (e.g. "*.txt"), whether the path is a file or folder name, and whether
a file is
-                       considered indexable or not by the output connection.  The rule also
describes the action to take should the rule be matched: include or exclude.</p>
+                       considered indexable or not by the output connection.  The rule also
describes the action to take should the rule be matched: include or exclude.  The file specification
+                       character "*" is a wildcard which matches zero or more characters,
while the character "?" matches exactly one character.  All other characters must match 
+                       exactly.</p>
                 <p>To add a rule for a starting path, select the desired values of
all the pulldowns, type in the desired file criteria, and click the "Add" button.  You may
also insert
                        a new rule above any existing rule, by using one of the "Insert" buttons.</p>
                 <p>The "Security" tab looks like this:</p>
@@ -764,6 +928,14 @@
                     <tr><td>Sybase (10+)</td><td>datetime</td><td><code>DATEADD(ms,
$(STARTTIME), '19700101')</code></td></tr>
+                <p>The "Security" tab simply allows you to add specific access tokens
to all documents indexed with a general database job.  In order for you to know what tokens
+                       to add, you must decide with what authority connection these documents
will be secured, and understand the form of the access tokens used by that authority connection
+                       type.  This is what the "Security" tab looks like:</p>
+                <br/>
+                <br/>
+                <figure src="images/jdbc-job-security.PNG" alt="Generic Database Job,
Security tab" width="80%"/>
+                <br/><br/>
+                <p>Enter a desired access token, and click the "Add" button.  You may
enter multiple access tokens.</p>
             <section id="filenetrepository">

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-job-security.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-job-security.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-configure-bandwidth.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-configure-bandwidth.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-configure-email.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-configure-email.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-configure-proxy.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-configure-proxy.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-configure-robots.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-configure-robots.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-canonicalization.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-canonicalization.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-dechromed-content.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-dechromed-content.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-mappings.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-mappings.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-metadata.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-metadata.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-security.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-security.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-time-values.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-time-values.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-job-urls.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-job-urls.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/rss-status.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/rss-status.PNG
    svn:mime-type = application/octet-stream

View raw message