+ +

+ + +

+ +

+ + +

+ + + + + +

+ PDF +

Writing repository connectors

+Writing repository connectors +
- +Key concepts +
- +Implementing the Repository Connector class +
  - +Principle methods +
  - +Model +
  - +Choosing a document identifier format +
  - +Choosing the form of the document version string +
  - +Notes on connector UI methods +
  +
- +Implementation support provided by the framework +
- +DO's and DON'T DO's +
+

+ + +

Writing repository connectors

A repository connector furnishes the mechanism for obtaining documents, metadata, and authority tokens from a repository. The documents are expected to be handed to an output connector (described elsewhere) for ingestion into some other back-end repository.

As is the case with all connectors under the ManifoldCF umbrella, an output connector consists of only one part:

A class implementing an interface (in this case, org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector)

+ +

Key concepts

The repository connector abstraction makes use of, or introduces, the following concepts:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Concept	What it is
Configuration parameters	A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific repository connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams
Repository connection	A repository connector instance that has been furnished with configuration data
Document identifier	An arbitrary identifier, whose meaning determined only within the context of a specific repository connector, which the connector uses to describe a document within a repository
Document URI	The unique URI (or, in some cases, file IRI) of a document, which is meant to be displayed in search engine results as the link to the document
Repository document	An object that describes a document's contents, including raw document data (as a stream), metadata (as either strings or streams), and access tokens; see org.apache.manifoldcf.agents.interfaces.RepositoryDocument
Access token	A string, which is only meaningful in the context of a specific authority, that describes a quantum of authorization for a user
Connection management/threading/pooling model	How an individual repository connector class instance is managed and used
Activity infrastructure	The framework API provided to specific methods allowing those methods to perform specific actions within the framework, e.g. recording the activity history; see org.apache.manifoldcf.crawler.interfaces.IVersionActivity, and org.apache.manifoldcf.crawler.interfaces.IProcessActivity, and org.apache.manifoldcf.crawler.interfaces.ISeedingActivity
Document specification	A hierarchical structure, internally represented as an XML document, which describes what a specific repository connector should do in the context of a specific job; see org.apache.manifoldcf.crawler.interfaces.DocumentSpecification
Document version string	A simple string, used for comparison purposes, that allows ManifoldCF to figure out if a fetch or ingestion operation needs to be repeated as a result of changes to the document specification in effect for a document, or because of changes to the document itself
Service interruption	A specific kind of exception that signals ManifoldCF that the output repository is unavailable, and gives a best estimate of when it might become available again; see org.apache.manifoldcf.agents.interfaces.ServiceInterruption

+ +

Implementing the Repository Connector class

A very good place to start is to read the javadoc for the repository connector interface. You will note that the javadoc describes the usage and pooling model for a connector class pretty thoroughly. It is very important to understand the model thoroughly in order to write reliable connectors! Use of static variables, for one thing, must be done in a very careful way, to avoid issues that would be hard to detect with a cursory test.

The second thing to do is to examine some of the provided repository connector implementations. There are a wide variety of connectors include with ManifoldCF that exercise just about every aspect of the repository connector interface. These are:

Documentum (uses RMI to segregate native code, etc.)
FileNet (also uses RMI, but because it is picky about its open-source jar versions)
File system (a good, but simple, example)
LiveLink (demonstrates use of local keystore infrastructure)
Meridio (local keystore, web services, result sets)
SharePoint (local keystore, web services)
RSS (local keystore, binning)
Web (local database schema, local keystore, binning, events and prerequisites, cache management)

You will also note that all of these connectors extend a framework-provided repository connector base class, found at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector. This base class furnishes some basic bookkeeping logic for managing the connector pool, as well as default implementations of some of the less typical functionality a connector may have. For example, connectors are allowed to have database tables of their own, which are instantiated when the connector is registered, and are torn down when the connector is removed. This is, however, not very typical, and the base implementation reflects that.

+ +

Principle methods

The principle methods an implementer should be concerned with for creating a repository connector are the following:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Method	What it should do
addSeedDocuments()	Use the supplied document specification to come up with an initial set of document identifiers
getDocumentVersions()	Come up with a version string for each of the documents described by the supplied set of document identifiers, or signal if the document is no longer present
processDocuments()	Take the appropriate action (e.g. ingest, or extract references from, or whatever) for a given set of documents described by document identifier and version string
outputConfigurationHeader()	Output the head-section part of a repository connection ConfigParams editing page
outputConfigurationBody()	Output the body-section part of a repository connection ConfigParams editing page
processConfigurationPost()	Receive and process form data from a repository connection ConfigParams editing page
viewConfiguration()	Output the viewing HTML for a repository connection ConfigParams object
outputSpecificationHeader()	Output the head-section part of an DocumentSpecification editing page
outputSpecificationBody()	Output the body-section part of an DocumentSpecification editing page
processSpecificationPost()	Receive and process form data from an DocumentSpecification editing page
viewSpecification()	Output the viewing page for an DocumentSpecification object

These methods come in three broad classes: (a) functional methods for doing the work of the connector; (b) UI methods for configuring a connection; and (c) UI methods for editing the document specification for a job. Together they do the heavy lifting of your connector. But before you can write any code at all, you need to plan things out a bit.

+ +

Model

Each connector must declare a specific model which it adheres to. These models basically describe what the addSeedDocuments() method actually does, and are described below.

+ + + + + + + + + + + + + + + + + + + + + + +

Model	Description
MODEL_ADD	The addSeedDocuments() method supplies at least all the matching documents that have been added to the repository, within the specified time interval
MODEL_ADD_CHANGE	The addSeedDocuments() method supplies at least those matching documents that have been added or changed in the repository, within the specified time interval
MODEL_ADD_CHANGE_DELETE	The addSeedDocuments() method supplies at least those matching documents that have been added, changed, or removed in the repository, within the specified time interval
MODEL_PARTIAL	The addSeedDocuments() does not return a complete list of documents that match the criteria and time interval, because some of those documents are no longer discoverable

Note that the choice of model is actually much more subtle than the above description might indicate. It may, for one thing, be affected by characteristics of the repository, such as whether the repository considers a document to have been changed if its security information was changed. This would mean that, even though most document changes are picked up and thus one might be tempted to declare the connector to be MODEL_ADD_CHANGE, the correct choice would in fact be MODEL_ADD.

Another subtle point is what documents the connector is actually supposed to return by means of the addSeedDocuments() method. The start time and end time parameters handed to the method do not have to be strictly adhered to, for instance; it is always okay to return more documents. It is never okay for the connector to return fewer documents than were requested, on the other hand.

+ +

Choosing a document identifier format

In order to decide on the format for a document identifier, you need to understand what this identifier is used for, and what it represents. A document identifier usually corresponds to some entity within the source repository, such as a document or a folder. Note that there is no requirement that the identifier represent indexable content.

The document identifier must be capable of furnishing enough information to:

Calculate a version string for the document
Find child references for the document
Get the document's content, metadata, and access tokens

We highly recommend that no additional information be included in the document identifier, other than what is needed for the above, as that will almost certainly cause problems.

+ +

Choosing the form of the document version string

The document version string is used by ManifoldCF to determine whether or not the document or configuration changed in such a way as to require that the document be reprocessed. ManifoldCF therefore requests the version string for any document that is ready for processing, and usually does not process the document again if the returned version string agrees with the version string it has stored.

Thinking about it more carefully, it is clear that what a connector writer needs to do is include everything in the version string that could potentially affect how the document gets processed. That may include the version of the document in the repository, bits of configuration information, metadata, and even access tokens (if the underlying repository versions these things independently from the document itself). Storing all of that information in the version string seems like a lot - but the string is unlimited in length, and it actually serves another useful purpose to do it that way. Specifically, when it comes time to do the actual processing, it's often the correct thing to do to obtain the necessary data out of the version string, rather than calculating it or fetching it anew. That way of working guarantees that the document processing was done in a manner that agrees with its recorded version string, thus eliminating any chance of ManifoldCF getting confused .

For longer data that needs to persist between the getDocumentVersions() method call and the processDocuments() method call, the connector is welcome to save this information in a temporary disk file. To help make sure nothing leaks which this approach is used, the IRepositoryConnector interface has a method that will be called to clean up any temporary files that might have been created in the handling of a given document identifier.

+ +

Notes on connector UI methods

The crawler UI uses a tabbed layout structure, and thus each of these elements must properly implement the tabbed model. This means that the "header" methods above must add the desired tab names to a specified array, and the "body" methods must provide appropriate HTML which handles both the case where a tab is displayed, and where it is not displayed. Also, it makes sense to use the appropriate css definitions, so that the connector UI pages have a similar look-and-feel to the rest of ManifoldCF's crawler ui. We strongly suggest starting with one of the supplied connector's UI code, both for a description of the arguments to each page, and for some decent ideas of ways to organize your connector's UI code.

Please also note that it is good practice to name the form fields in your HTML in such a way that they cannot collide with form fields that may come from the framework's HTML or any specific output connector's HTML. The DocumentSpecification editing HTML especially may be prone to collisions, because within any given job, this HTML is included in the same page as HTML from the chosen output connector.

+ +

Implementation support provided by the framework

ManifoldCF's framework provides a number of helpful services designed to make the creation of a connector easier. These services are summarized below. (This is not an exhaustive list, by any means.)

Lock management and synchronization (see org.apache.manifoldcf.core.interfaces.LockManagerFactory)
Cache management (see org.apache.manifoldcf.core.interfaces.CacheManagerFactory)
Local keystore management (see org.apache.manifoldcf.core.KeystoreManagerFactory)
Database management (see org.apache.manifoldcf.core.DBInterfaceFactory)

For UI method support, these too are very useful:

Multipart form processing (see org.apache.manifoldcf.ui.multipart.MultipartWrapper)
HTML encoding (see org.apache.manifoldcf.ui.util.Encoder)
HTML formatting (see org.apache.manifoldcf.ui.util.Formatter)

+ +

DO's and DON'T DO's

It's always a good idea to make use of an existing infrastructure component, if it's meant for that purpose, rather than inventing your own. There are, however, some limitations we recommend you adhere to.

DO make use of infrastructure components described in the section above
DON'T make use of infrastructure components that aren't mentioned, without checking first
NEVER write connector code that directly uses framework database tables, other than the ones installed and managed by your connector

If you are tempted to violate these rules, it may well mean you don't understand something important. At the very least, we'd like to know why. Send email to connectors-dev@incubator.apache.org with a description of your problem and how you are tempted to solve it.

+ +

Connector type	Function
Authority connector	Furnishes a standard way of mapping a user name to access tokens that are meaningful for a given type of repository
Repository connector	Fetches documents from a specific kind of repository, such as SharePoint or off the web
Output connector	Pushes document ingestion requests and deletion requests to a specific kind of back end search engine or other entity, such as Lucene

Connector type

Function

Authority connector

Furnishes a standard way of mapping a user name to access tokens that are meaningful for a given type of repository

Repository connector

Fetches documents from a specific kind of repository, such as SharePoint or off the web

Output connector

Pushes document ingestion requests and deletion requests to a specific kind of back end search engine or other entity, such as Lucene

Writing connectors for ManifoldCF is a great way to learn about the project and contribute something useful! Read about how to do that by navigating to the links provided below.

Concept	What it is
Configuration parameters	A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific authority connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams
Authority connection	An authority connector instance that has been furnished with configuration data
User name	The name of a user, which is often a Kerberos principal name, e.g. john@apache.org
Access token	An arbitrary string, which is only meaningful within the context of a specific authority connector, that describes a quantum of authorization
Connection management/threading/pooling model	How an individual authority connector class instance is managed and used
Service interruption	A specific kind of exception that signals ManifoldCF that the output repository is unavailable, and gives a best estimate of when it might become available again; see org.apache.manifoldcf.agents.interfaces.ServiceInterruption

Concept

What it is

Configuration parameters

A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific authority connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams

Authority connection

An authority connector instance that has been furnished with configuration data

User name

The name of a user, which is often a Kerberos principal name, e.g. john@apache.org

Access token

An arbitrary string, which is only meaningful within the context of a specific authority connector, that describes a quantum of authorization

Connection management/threading/pooling model

How an individual authority connector class instance is managed and used

Service interruption

A specific kind of exception that signals ManifoldCF that the output repository is unavailable, and gives a best estimate of when it might become available again; see org.apache.manifoldcf.agents.interfaces.ServiceInterruption

Method	What it should do
getAuthorizationResponse()	Obtain the authorization response, given a user name
outputConfigurationHeader()	Output the head-section part of an authority connection ConfigParams editing page
outputConfigurationBody()	Output the body-section part of an authority connection ConfigParams editing page
processConfigurationPost()	Receive and process form data from an authority connection ConfigParams editing page
viewConfiguration()	Output the viewing HTML for an authority connection ConfigParams object

Method

What it should do

getAuthorizationResponse()

Obtain the authorization response, given a user name

outputConfigurationHeader()

Output the head-section part of an authority connection ConfigParams editing page

outputConfigurationBody()

Output the body-section part of an authority connection ConfigParams editing page

processConfigurationPost()

Receive and process form data from an authority connection ConfigParams editing page

viewConfiguration()

Output the viewing HTML for an authority connection ConfigParams object

Condition	Meaning
RESPONSE_OK	The access tokens for the user were successfully obtained from the repository, and are being returned
RESPONSE_UNREACHABLE	The repository is currently unreachable, and appropriate disabling tokens are being returned
RESPONSE_USERNOTFOUND	The user was not found within the repository, and appropriate disabling tokens are being returned
RESPONSE_USERUNAUTHORIZED	The user was found, but was in some way disabled, and appropriate disabling tokens are being returned

Condition

Meaning

RESPONSE_OK

The access tokens for the user were successfully obtained from the repository, and are being returned

RESPONSE_UNREACHABLE

The repository is currently unreachable, and appropriate disabling tokens are being returned

RESPONSE_USERNOTFOUND

The user was not found within the repository, and appropriate disabling tokens are being returned

RESPONSE_USERUNAUTHORIZED

The user was found, but was in some way disabled, and appropriate disabling tokens are being returned

Concept	What it is
Configuration parameters	A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific output connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams
Output connection	An output connector instance that has been furnished with configuration data
Document URI	The unique URI (or, in some cases, file IRI) of a document, which is meant to be displayed in search engine results as the link to the document
Repository document	An object that describes a document's contents, including raw document data (as a stream), metadata (as either strings or streams), and access tokens; see org.apache.manifoldcf.agents.interfaces.RepositoryDocument
Connection management/threading/pooling model	How an individual output connector class instance is managed and used
Activity infrastructure	The framework API provided to specific methods allowing those methods to perform specific actions within the framework, e.g. recording activities; see org.apache.manifoldcf.agents.interfaces.IOutputAddActivity and org.apache.manifoldcf.agents.interfaces.IOutputRemoveActivity
Output specification	A hierarchical structure, internally represented as an XML document, which describes what a specific output connector should do in the context of a specific job; see org.apache.manifoldcf.agents.interfaces.OutputSpecification
Output version string	A simple string, used for comparison purposes, that allows ManifoldCF to figure out if an ingestion operation needs to be repeated as a result of changes to the output specification in effect for a document
Service interruption	A specific kind of exception that signals ManifoldCF that the output repository is unavailable, and gives a best estimate of when it might become available again; see org.apache.manifoldcf.agents.interfaces.ServiceInterruption

Concept

What it is

Configuration parameters

A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific output connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams

Output connection

An output connector instance that has been furnished with configuration data

Document URI

The unique URI (or, in some cases, file IRI) of a document, which is meant to be displayed in search engine results as the link to the document

Repository document

An object that describes a document's contents, including raw document data (as a stream), metadata (as either strings or streams), and access tokens; see org.apache.manifoldcf.agents.interfaces.RepositoryDocument

Connection management/threading/pooling model

How an individual output connector class instance is managed and used

Activity infrastructure

The framework API provided to specific methods allowing those methods to perform specific actions within the framework, e.g. recording activities; see org.apache.manifoldcf.agents.interfaces.IOutputAddActivity and org.apache.manifoldcf.agents.interfaces.IOutputRemoveActivity

Output specification

A hierarchical structure, internally represented as an XML document, which describes what a specific output connector should do in the context of a specific job; see org.apache.manifoldcf.agents.interfaces.OutputSpecification

Output version string

A simple string, used for comparison purposes, that allows ManifoldCF to figure out if an ingestion operation needs to be repeated as a result of changes to the output specification in effect for a document

Service interruption

Method	What it should do
checkDocumentIndexable()	Decide whether a file is indexable or not
getOutputDescription()	Use the supplied output specification to come up with an output version string
addOrReplaceDocument()	Add or replace the specified document within the target repository, or signal if the document cannot be handled
removeDocument()	Remove the specified document from the target repository
outputConfigurationHeader()	Output the head-section part of an output connection ConfigParams editing page
outputConfigurationBody()	Output the body-section part of an output connection ConfigParams editing page
processConfigurationPost()	Receive and process form data from an output connection ConfigParams editing page
viewConfiguration()	Output the viewing HTML for an output connection ConfigParams object
outputSpecificationHeader()	Output the head-section part of an OutputSpecification editing page
outputSpecificationBody()	Output the body-section part of an OutputSpecification editing page
processSpecificationPost()	Receive and process form data from an OutputSpecification editing page
viewSpecification()	Output the viewing page for an OutputSpecification object

Method

What it should do

checkDocumentIndexable()

Decide whether a file is indexable or not

getOutputDescription()

Use the supplied output specification to come up with an output version string

addOrReplaceDocument()

Add or replace the specified document within the target repository, or signal if the document cannot be handled

removeDocument()

Remove the specified document from the target repository

outputConfigurationHeader()

Output the head-section part of an output connection ConfigParams editing page

outputConfigurationBody()

Output the body-section part of an output connection ConfigParams editing page

processConfigurationPost()

Receive and process form data from an output connection ConfigParams editing page

viewConfiguration()

Output the viewing HTML for an output connection ConfigParams object

outputSpecificationHeader()

Output the head-section part of an OutputSpecification editing page

outputSpecificationBody()

Output the body-section part of an OutputSpecification editing page

processSpecificationPost()

Receive and process form data from an OutputSpecification editing page

viewSpecification()

Output the viewing page for an OutputSpecification object

Concept

What it is

Configuration parameters

A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific repository connector, i.e. how the connector should do its job; see org.apache.manifoldcf.core.interfaces.ConfigParams

Repository connection

A repository connector instance that has been furnished with configuration data

Document identifier

An arbitrary identifier, whose meaning determined only within the context of a specific repository connector, which the connector uses to describe a document within a repository

Document URI

The unique URI (or, in some cases, file IRI) of a document, which is meant to be displayed in search engine results as the link to the document

Repository document

Access token

A string, which is only meaningful in the context of a specific authority, that describes a quantum of authorization for a user

Connection management/threading/pooling model

How an individual repository connector class instance is managed and used

Activity infrastructure

The framework API provided to specific methods allowing those methods to perform specific actions within the framework, e.g. recording the activity history; see org.apache.manifoldcf.crawler.interfaces.IVersionActivity, and org.apache.manifoldcf.crawler.interfaces.IProcessActivity, and org.apache.manifoldcf.crawler.interfaces.ISeedingActivity

Document specification

A hierarchical structure, internally represented as an XML document, which describes what a specific repository connector should do in the context of a specific job; see org.apache.manifoldcf.crawler.interfaces.DocumentSpecification

Document version string

A simple string, used for comparison purposes, that allows ManifoldCF to figure out if a fetch or ingestion operation needs to be repeated as a result of changes to the document specification in effect for a document, or because of changes to the document itself

Service interruption

Method

What it should do

addSeedDocuments()

Use the supplied document specification to come up with an initial set of document identifiers

getDocumentVersions()

Come up with a version string for each of the documents described by the supplied set of document identifiers, or signal if the document is no longer present

processDocuments()

Take the appropriate action (e.g. ingest, or extract references from, or whatever) for a given set of documents described by document identifier and version string

outputConfigurationHeader()

Output the head-section part of a repository connection ConfigParams editing page

outputConfigurationBody()

Output the body-section part of a repository connection ConfigParams editing page

processConfigurationPost()

Receive and process form data from a repository connection ConfigParams editing page

viewConfiguration()

Output the viewing HTML for a repository connection ConfigParams object

outputSpecificationHeader()

Output the head-section part of an DocumentSpecification editing page

outputSpecificationBody()

Output the body-section part of an DocumentSpecification editing page

processSpecificationPost()

Receive and process form data from an DocumentSpecification editing page

viewSpecification()

Output the viewing page for an DocumentSpecification object

Model

Description

MODEL_ADD

The addSeedDocuments() method supplies at least all the matching documents that have been added to the repository, within the specified time interval

MODEL_ADD_CHANGE

The addSeedDocuments() method supplies at least those matching documents that have been added or changed in the repository, within the specified time interval

MODEL_ADD_CHANGE_DELETE

The addSeedDocuments() method supplies at least those matching documents that have been added, changed, or removed in the repository, within the specified time interval

MODEL_PARTIAL

The addSeedDocuments() does not return a complete list of documents that match the criteria and time interval, because some of those documents are no longer discoverable