incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Lucene Connector Framework > How to Write a Repository Connector
Date Fri, 20 Aug 2010 08:28:00 GMT
Space: Lucene Connector Framework (https://cwiki.apache.org/confluence/display/CONNECTORS)
Page: How to Write a Repository Connector (https://cwiki.apache.org/confluence/display/CONNECTORS/How+to+Write+a+Repository+Connector)


Edited by Karl Wright:
---------------------------------------------------------------------
h1. Writing a Repository Connector

A repository connector furnishes the mechanism for obtaining documents, metadata, and authority
tokens from a repository.  The documents are expected to be handed to an output connector
(described elsewhere) for ingestion into some other back-end repository.

As is the case with all connectors under the ACF umbrella, an output connector consists of
only one part:

* A class implementing an interface (in this case, _org.apache.lcf.crawler.interfaces.IRepositoryConnector_)

h3. Key concepts

The repository connector abstraction makes use of, or introduces, the following concepts:

|| Concept || What it is ||
| Configuration parameters | A hierarchical structure, internally represented as an XML document,
which describes a specific configuration of a specific repository connector, i.e. *how* the
connector should do its job; see _org.apache.lcf.core.interfaces.ConfigParams_ |
| Repository connection | A repository connector instance that has been furnished with configuration
data |
| Document identifier | An arbitrary identifier, whose meaning determined only within the
context of a specific repository connector, which the connector uses to describe a document
within a repository |
| Document URI | The unique URI (or, in some cases, file IRI) of a document, which is meant
to be displayed in search engine results as the link to the document |
| Repository document | An object that describes a document's contents, including raw document
data (as a stream), metadata (as either strings or streams), and access tokens; see _org.apache.lcf.agents.interfaces.RepositoryDocument_
|
| Access token | A string, which is only meaningful in the context of a specific authority,
that describes a quantum of authorization for a user |
| Connection management/threading/pooling model | How an individual repository connector class
instance is managed and used |
| Activity infrastructure | The framework API provided to specific methods allowing those
methods to perform specific actions within the framework, e.g. recording the activity history;
see _org.apache.lcf.crawler.interfaces.IVersionActivity_, and _org.apache.lcf.crawler.interfaces.IProcessActivity_,
and _org.apache.lcf.crawler.interfaces.ISeedingActivity_ |
| Document specification | A hierarchical structure, internally represented as an XML document,
which describes *what* a specific repository connector should do in the context of a specific
job; see _org.apache.lcf.crawler.interfaces.DocumentSpecification_ |
| Document version string | A simple string, used for comparison purposes, that allows ACF
to figure out if a fetch or ingestion operation needs to be repeated as a result of changes
to the document specification in effect for a document, or because of changes to the document
itself |
| Service interruption | A specific kind of exception that signals ACF that the output repository
is unavailable, and gives a best estimate of when it might become available again; see _org.apache.lcf.agents.interfaces.ServiceInterruption_
|

h3. Implementing the Repository Connector class

A very good place to start is to read the javadoc for the repository connector interface.
 You will note that the javadoc describes the usage and pooling model for a connector class
pretty thoroughly.  It is very important to understand the model thoroughly in order to write
reliable connectors!  Use of static variables, for one thing, must be done in a very careful
way, to avoid issues that would be hard to detect with a cursory test.

The second thing to do is to examine some of the provided repository connector implementations.
 There are a wide variety of connectors include with ACF that exercise just about every aspect
of the repository connector interface.  These are:

* Documentum (uses RMI to segregate native code, etc.)
* FileNet (also uses RMI, but because it is picky about its open-source jar versions)
* File system (a good, but simple, example)
* LiveLink (demonstrates use of local keystore infrastructure)
* Memex
* Meridio (local keystore, web services, result sets)
* SharePoint (local keystore, web services)
* RSS (local keystore, binning)
* Web (local database schema, local keystore, binning, events and prerequisites, cache management)

You will also note that all of these connectors extend a framework-provided repository connector
base class, found at _org.apache.lcf.crawler.connectors.BaseRepositoryConnector_.  This base
class furnishes some basic bookkeeping logic for managing the connector pool, as well as default
implementations of some of the less typical functionality a connector may have.  For example,
connectors are allowed to have database tables of their own, which are instantiated when the
connector is registered, and are torn down when the connector is removed.  This is, however,
not very typical, and the base implementation reflects that.

h5. Principle methods

The principle methods an implementer should be concerned with for creating a repository connector
are the following:

|| Method || What it should do ||
| *addSeedDocuments()* | Use the supplied document specification to come up with an initial
set of document identifiers |
| *getDocumentVersions()* | Come up with a version string for each of the documents described
by the supplied set of document identifiers, or signal if the document is no longer present
|
| *processDocuments()* | Take the appropriate action (e.g. ingest, or extract references from,
or whatever) for a given set of documents described by document identifier and version string
|
| *outputConfigurationHeader()* | Output the head-section part of a repository connection
_ConfigParams_ editing page |
| *outputConfigurationBody()* | Output the body-section part of a repository connection _ConfigParams_
editing page |
| *processConfigurationPost()* | Receive and process form data from a repository connection
_ConfigParams_ editing page |
| *viewConfiguration()* | Output the viewing HTML for a repository connection _ConfigParams_
object |
| *outputSpecificationHeader()* | Output the head-section part of an _DocumentSpecification_
editing page |
| *outputSpecificationBody()* | Output the body-section part of an _DocumentSpecification_
editing page |
| *processSpecificationPost()* | Receive and process form data from an _DocumentSpecification_
editing page |
| *viewSpecification()* | Output the viewing page for an _DocumentSpecification_ object |

These methods come in three broad classes: (a) functional methods for doing the work of the
connector; (b) UI methods for configuring a connection; and (c) UI methods for editing the
document specification for a job.  Together they do the heavy lifting of your connector. 
But before you can write any code at all, you need to plan things out a bit.

h5. Model

Each connector must declare a specific model which it adheres to.  These models basically
describe what the *addSeedDocuments()* method actually does, and are described below.

|| Model || Description ||
| _MODEL_ADD_ | The *addSeedDocuments()* method supplies at least all the matching documents
that have been added to the repository, within the specified time interval |
| _MODEL_ADD_CHANGE_ | The *addSeedDocuments()* method supplies at least those matching documents
that have been added or changed in the repository, within the specified time interval |
| _MODEL_ADD_CHANGE_DELETE_ | The *addSeedDocuments()* method supplies at least those matching
documents that have been added, changed, or removed in the repository, within the specified
time interval |
| _MODEL_PARTIAL_ | The *addSeedDocuments()* does not return a complete list of documents
that match the criteria and time interval, because some of those documents are no longer discoverable
|

Note that the choice of model is actually much more subtle than the above description might
indicate.  It may, for one thing, be affected by characteristics of the repository, such as
whether the repository considers a document to have been changed if its security information
was changed.  This would mean that, even though most document changes are picked up and thus
one might be tempted to declare the connector to be _MODEL_ADD_CHANGE_, the correct choice
would in fact be _MODEL_ADD_.

Another subtle point is what documents the connector is actually supposed to return by means
of the *addSeedDocuments()* method.  The start time and end time parameters handed to the
method do not have to be strictly adhered to, for instance; it is always okay to return more
documents.  It is never okay for the connector to return fewer documents than were requested,
on the other hand.

h5. Choosing a document identifier format

In order to decide on the format for a document identifier, you need to understand what this
identifier is used for, and what it represents.  A document identifier usually corresponds
to some entity within the source repository, such as a document or a folder.  Note that there
is *no* requirement that the identifier represent indexable content.

The document identifier must be capable of furnishing enough information to:

* Calculate a version string for the document
* Find child references for the document
* Get the document's content, metadata, and access tokens

We highly recommend that no additional information be included in the document identifier,
other than what is needed for the above, as that will almost certainly cause problems.

h5.  Choosing the form of the document version string

The document version string is used by ACF to determine whether or not the document or configuration
changed in such a way as to require that the document be reprocessed.  ACF therefore requests
the version string for any document that is ready for processing, and usually does not process
the document again if the returned version string agrees with the version string it has stored.

Thinking about it more carefully, it is clear that what a connector writer needs to do is
include everything in the version string that could potentially affect how the document gets
processed.  That may include the version of the document in the repository, bits of configuration
information, metadata, and even access tokens (if the underlying repository versions these
things independently from the document itself).  Storing all of that information in the version
string seems like a lot - but the string is unlimited in length, and it actually serves another
useful purpose to do it that way.  Specifically, when it comes time to do the actual processing,
it's often the correct thing to do to obtain the necessary data out of the version string,
rather than calculating it or fetching it anew.  That way of working guarantees that the document
processing was done in a manner that agrees with its recorded version string, thus eliminating
any chance of ACF getting confused.

For longer data that needs to persist between the *getDocumentVersions()* method call and
the *processDocuments()* method call, the connector is welcome to save this information in
a temporary disk file.  To help make sure nothing leaks which this approach is used, the IRepositoryConnector
interface has a method that will be called to clean up any temporary files that might have
been created in the handling of a given document identifier.

h5. Notes on connector UI methods

The crawler UI uses a tabbed layout structure, and thus each of these elements must properly
implement the tabbed model.  This means that the "header" methods above must add the desired
tab names to a specified array, and the "body" methods must provide appropriate HTML which
handles both the case where a tab is displayed, and where it is not displayed.  Also, it makes
sense to use the appropriate css definitions, so that the connector UI pages have a similar
look-and-feel to the rest of ACF's crawler ui.  We strongly suggest starting with one of the
supplied connector's UI code, both for a description of the arguments to each page, and for
some decent ideas of ways to organize your connector's UI code.  

Please also note that it is good practice to name the form fields in your HTML in such a way
that they cannot collide with form fields that may come from the framework's HTML or any specific
output connector's HTML.  The _DocumentSpecification_ editing HTML especially may be prone
to collisions, because within any given job, this HTML is included in the same page as HTML
from the chosen output connector.


h3. Implementation support provided by the framework

ACF's framework provides a number of helpful services designed to make the creation of a connector
easier.  These services are summarized below.  (This is not an exhaustive list, by any means.)

* Lock management and synchronization (see _org.apache.lcf.core.interfaces.LockManagerFactory_)
* Cache management (see _org.apache.lcf.core.interfaces.CacheManagerFactory_)
* Local keystore management (see _org.apache.lcf.core.KeystoreManagerFactory_)
* Database management (see _org.apache.lcf.core.DBInterfaceFactory_)

For UI method support, these too are very useful:

* Multipart form processing (see _org.apache.lcf.ui.multipart.MultipartWrapper_)
* HTML encoding (see _org.apache.lcf.ui.util.Encoder_)
* HTML formatting (see _org.apache.lcf.ui.util.Formatter_)

h3. DO's and DON'T DO's

It's always a good idea to make use of an existing infrastructure component, if it's meant
for that purpose, rather than inventing your own.  There are, however, some limitations we
recommend you adhere to.

* DO make use of infrastructure components described in the section above
* DON'T make use of infrastructure components that aren't mentioned, without checking first
* NEVER write connector code that directly uses framework database tables, other than the
ones installed and managed by your connector

If you are tempted to violate these rules, it may well mean you don't understand something
important.  At the very least, we'd like to know why.  Send email to connectors-dev@incubator.apache.org
with a description of your problem and how you are tempted to solve it.


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message