incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r934994 [2/2] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Fri, 16 Apr 2010 16:50:25 GMT
Modified: incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml
--- incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml (original)
+++ incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml Fri Apr
16 16:50:24 2010
@@ -518,6 +518,50 @@
         <section id="repositoryconnectiontypes">
             <title>Repository Connection Types</title>
+            <section id="filesystemrepository">
+                <title>Generic File System Repository Connection</title>
+                <p>The generic file system repository connection type was developed
primarily as an example, demonstration, and testing tool, although it can potentially be useful
for indexing local
+                       files that exist on the same machine that Lucene Connectors Framework
is running on.  Bear in mind that there is no support in this connection type for any kind
+                       security, and the options are somewhat limited.</p>
+                <p>The file system repository connection type provides no configuration
tabs beyond the standard ones.  However, jobs created using a file-system-type repository
+                       have two tabs in addition to the standard repertoire: the "Hop Filters"
tab, and the "Paths" tab.</p>
+                <p>The "Hop Filters" tab allows you to restrict the document set by
the number of child hops from the path root.  While this is not terribly interesting in the
case of a file
+                       system, the same basic functionality is also used in the web connector,
where it is a more important feature.  The file system connection type gives you a way to
+                       how this feature works, in a more predictable environment:</p>
+                <br/><br/>
+                <figure src="images/filesystem-job-hopcount.PNG" alt="File System Connection,
Hop Filters tab" width="80%"/>
+                <br/><br/>
+                <p>In the case of the file system connection type, there is only one
variety of relationship between documents, which is called a "child" relationship.  If you
want to
+                       restrict the document set by how far away a document is from the path
root, enter the maximum allowed number of hops in the text box.  Leaving the box blank
+                       indicates that no such filtering will take place.</p>
+                <p>On this same tab, you can tell the Framework what to do should there
be changes in the distance from the root to a document.  The choice "Delete unreachable
+                       documents" requires the Framework to recalculate the distance to every
potentially affected document whenever a change takes place.  This may require
+                       expensive bookkeeping, however, so you also have the option of  ignoring
such changes.  There are two varieties of this latter option - you can ignore the changes
+                       for now, with the option of turning back on the aggressive bookkeeping
at a later time, or you can decide not to ever allow changes to propagate, in which case
+                       the Framework will discard the necessary bookkeeping information permanently.</p>
+                <p>The "Paths" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/filesystem-job-paths.PNG" alt="File System Connection,
Paths tab" width="80%"/>
+                <br/><br/>
+                <p>This tab allows you to type in a set of paths which function as
the roots of the crawl.  For each desired path, type in the path and click the "Add" button
to add it to
+                       the list.  The form of the path you type in obviously needs to be
meaningful for the operating system the Framework is running on.</p>
+                <p>Each root path has a set of rules which determines whether a document
is included or not in the set for the job.  Once you have added the root path to the list,
+                       may then add rules to it.  Each rule has a match expression, an indication
of whether the rule is intended to match files or directories, and an action (include or exclude).
+                       Rules are evaluated from top to bottom, and the first rule that matches
the file name is the one that is chosen.  To add a rule, select the desired pulldowns, type
+                       a match file specification (e.g. "*.txt"), and click the "Add" button.</p>
+            </section>
+            <section id="rssrepository">
+                <title>Generic RSS Repository Connection</title>
+                <p>More here later</p>
+            </section>
+            <section id="webrepository">
+                <title>Generic Web Repository Connection</title>
+                <p>More here later</p>
+            </section>
             <section id="jcifsrepository">
                 <title>Windows Share/DFS Repository Connection</title>
                 <p>The Windows Share connection type allows you to access content stored
on Windows shares, even from non-Windows systems.  Also supported are Samba and various
@@ -612,6 +656,86 @@
+            <section id="jdbcrepository">
+                <title>Generic Database Repository Connection</title>
+                <p>The generic database connection type allows you to index content
from a database table, served by one of the following databases:</p>
+                <br/>
+                <ul>
+                    <li>Postgresql (via a Postgresql JDBC driver)</li>
+                    <li>SQL Server (via the JTDS JDBC driver)</li>
+                    <li>Oracle (via the Oracle JDBC driver)</li>
+                    <li>Sybase (via the JTDS JDBC driver)</li>
+                </ul>
+                <br/>
+                <p>This connection type <b>cannot</b> be configured to
work with other databases as well without software changes.  Depending on your particular
+                       some of these options may not be available.</p>
+                <p>The generic database connection type currently has no per-document
notion of security.  It is possible to set document security for all documents specified by
+                       given job.  Since this form of security requires you to know what
the actual access tokens are, you must have detailed knowledge of the authority connection
+                       intend to use, and what sorts of access tokens it produces.</p>
+                <p>The generic database connection type provides three additional tabs
to the repository connection editing screen: the "Database Type" tab, the "Server" tab, and
+                       "Credentials" tab.  The "Database Type" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/jdbc-configure-database-type.PNG" alt="Generic Database
Connection, Database Type tab" width="80%"/>
+                <br/><br/>
+                <p>Select the kind of database you want to connect to, from the pulldown.</p>
+                <p>The "Server" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/jdbc-configure-server.PNG" alt="Generic Database Connection,
Server tab" width="80%"/>
+                <br/><br/>
+                <p>The server name and port must be provided in the "Database host
and port" field.  For example, for Oracle, the standard Oracle installation uses port 1521,
so you would
+                       enter something like, "my-oracle-server:1521" for this field.  Postgresql
uses port 5432 by default, so "my-postgresql-server:5432" would be required.  SQL Server's
+                       standard port is 1433, so use "my-sql-server:1433".</p>
+                <p>The service name or instance name field describes which instance
and database to connect to.  For Oracle or Postgresql, provide just the database name.  For
SQL Server, use
+                       "my-instance-name/my-database-name".  For SQL Server using the default
instance, use just the database name.</p>
+                <p>The "Credentials" tab is straightforward:</p>
+                <br/><br/>
+                <figure src="images/jdbc-configure-credentials.PNG" alt="Generic Database
Connection, Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Enter the database user credentials.</p>
+                <p>After you click the "Save" button, you will see a connection summary
screen, which might look something like this:</p>
+                <br/><br/>
+                <figure src="images/jdbc-status.PNG" alt="Generic Database Status" width="80%"/>
+                <br/><br/>
+                <p>Note that in this example, the generic database connection is not
properly authenticated, which is leading to an error status message instead of "Connection
+                <p>When you configure a job to use a repository connection of the generic
database type, several additional tabs are presented.  These are, in order, "Queries", and
+                <p>The "Queries" tab looks something like this:</p>
+                <br/><br/>
+                <figure src="images/jdbc-job-queries.PNG" alt="Generic Database Job, Queries
tab" width="80%"/>
+                <br/><br/>
+                <p>You must supply at least two queries.  (A third query is optional.)
 The purpose of these queries is to obtain the data needed for the database to be properly
+                       But in order for you to write these queries, you must make some decisions
first.  Basically, you need to figure out how best to map the constructs within your database
+                       to the requirements of the Framework.</p>
+                <br/>
+                <ul>
+                    <li>Obtain a list of document identifiers corresponding to changes
and additions that occurred within a specified time window (see below)</li>
+                    <li>Given a set of document identifiers, find the corresponding
version strings (see below)</li>
+                    <li>Given a set of document identifiers and version strings, find
information about the document, consisting of the document's data and access URL</li>
+                </ul>
+                <br/>
+                <p>The Framework uses a unique document identifier to describe every
document within the confines of a defined repository connection.  This document identifier
is used
+                       as a primary key to locate information about the document.  When you
set up a generic-database-type job, the database you are connecting to must have a similar
+                       concept.  If you pick the wrong thing for a document identifier, at
the very least you could find that the crawler runs very slowly.</p>
+                <p>Obtaining the list of document identifiers that represents the changes
that occurred over the given time frame must return <b>at least</b> all such changes.
 It is
+                        acceptable (although not ideal) for the returned list to be bigger
than that.</p>
+                <p>If you want your database connection to function in an incremental
manner, you must also come up with the format of a "version string".  This string is used
by the 
+                       Framework to determine if a document has changed.  It must change
whenever anything that might affect the document's indexing changes.  (It is not a problem
+                       it changes for other reasons, as long as it fulfills that principle
+                <p>The queries you provide get substituted before they are used by
the connector.  The example queries, which are present when the queries tab is first opened
for a
+                       new job, show many of these substitutions in roughly the manner in
which they are intended to be used.  For example, "$(IDCOLUMN)" will substitute a column
+                       name expected by the connector to contain the document identifier
into the query.  The list of substitution strings are as follows:</p>
+                <br/>
+                <table>
+                    <tr><td><b>String name</b></td><td><b>Meaning/use</b></td></tr>
+                    <tr><td>IDCOLUMN</td><td>The name of an expected
resultset column containing a document identifier</td></tr>
+                    <tr><td>VERSIONCOLUMN</td><td>The name of an
expected resultset column containing a version string</td></tr>
+                    <tr><td>URLCOLUMN</td><td>The name of an expected
resultset column containing a URL</td></tr>
+                    <tr><td>DATACOLUMN</td><td>The name of an expected
resultset column containing document data</td></tr>
+                    <tr><td>STARTTIME</td><td>A query string value
containing a start time in milliseconds since epoch</td></tr>
+                    <tr><td>ENDTIME</td><td>A query string value
containing an end time in milliseconds since epoch</td></tr>
+                    <tr><td>IDLIST</td><td>A query string value containing
a parenthesized list of document identifier values</td></tr>
+                </table>
+            </section>
             <section id="filenetrepository">
                 <title>IBM FileNet P8 Repository Connection</title>
                 <p>More here later</p>
@@ -622,16 +746,6 @@
                 <p>More here later</p>
-            <section id="filesystemrepository">
-                <title>Generic File System Repository Connection</title>
-                <p>More here later</p>
-            </section>
-            <section id="jdbcrepository">
-                <title>Generic Database Repository Connection</title>
-                <p>More here later</p>
-            </section>
             <section id="livelinkrepository">
                 <title>OpenText LiveLink Repository Connection</title>
                 <p>More here later</p>
@@ -652,16 +766,6 @@
                 <p>More here later</p>
-            <section id="rssrepository">
-                <title>Generic RSS Repository Connection</title>
-                <p>More here later</p>
-            </section>
-            <section id="webrepository">
-                <title>Generic Web Repository Connection</title>
-                <p>More here later</p>
-            </section>

Added: incubator/lcf/site/src/documentation/resources/images/filesystem-job-hopcount.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/filesystem-job-hopcount.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/filesystem-job-paths.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/filesystem-job-paths.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-credentials.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-credentials.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-database-type.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-database-type.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-server.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-configure-server.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-job-queries.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-job-queries.PNG
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/jdbc-status.PNG
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/jdbc-status.PNG
    svn:mime-type = application/octet-stream

View raw message