Return-Path: Delivered-To: apmail-incubator-connectors-commits-archive@minotaur.apache.org Received: (qmail 10509 invoked from network); 9 Mar 2010 16:39:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Mar 2010 16:39:50 -0000 Received: (qmail 80285 invoked by uid 500); 9 Mar 2010 16:39:23 -0000 Delivered-To: apmail-incubator-connectors-commits-archive@incubator.apache.org Received: (qmail 80249 invoked by uid 500); 9 Mar 2010 16:39:23 -0000 Mailing-List: contact connectors-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-dev@incubator.apache.org Delivered-To: mailing list connectors-commits@incubator.apache.org Received: (qmail 80242 invoked by uid 99); 9 Mar 2010 16:39:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Mar 2010 16:39:23 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Mar 2010 16:39:21 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 06B4D234C4B2 for ; Tue, 9 Mar 2010 16:39:00 +0000 (UTC) Date: Tue, 9 Mar 2010 16:39:00 +0000 (UTC) From: confluence@apache.org To: connectors-commits@incubator.apache.org Message-ID: <1978837569.2323.1268152740017.JavaMail.www-data@brutus.apache.org> Subject: [CONF] Lucene Connector Framework > How to Write a Repository Connector MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated Space: Lucene Connector Framework (http://cwiki.apache.org/confluence/display/CONNECTORS) Page: How to Write a Repository Connector (http://cwiki.apache.org/confluence/display/CONNECTORS/How+to+Write+a+Repository+Connector) Edited by Karl Wright: --------------------------------------------------------------------- h1. Writing a Repository Connector A repository connector furnishes the mechanism for obtaining documents, metadata, and authority tokens from a repository. The documents are expected to be handed to an output connector (described elsewhere) for ingestion into some other back-end repository. As is the case with all connectors under the LCF umbrella, an output connector consists of two parts: * A class implementing an interface (in this case, _org.apache.lcf.crawler.interfaces.IRepositoryConnector_) * A set of JSP's that implement the crawler UI for the connector h3. Key concepts The repository connector abstraction makes use of, or introduces, the following concepts: || Concept || What it is || | Configuration parameters | A hierarchical structure, internally represented as an XML document, which describes a specific configuration of a specific repository connector, i.e. *how* the connector should do its job; see _org.apache.lcf.core.interfaces.ConfigParams_ | | Repository connection | A repository connector instance that has been furnished with configuration data | | Document identifier | An arbitrary identifier, whose meaning determined only within the context of a specific repository connector, which the connector uses to describe a document within a repository | | Document URI | The unique URI (or, in some cases, file IRI) of a document, which is meant to be displayed in search engine results as the link to the document | | Repository document | An object that describes a document's contents, including raw document data (as a stream), metadata (as either strings or streams), and access tokens; see _org.apache.lcf.agents.interfaces.RepositoryDocument_ | | Access token | A string, which is only meaningful in the context of a specific authority, that describes a quantum of authorization for a user | | Connection management/threading/pooling model | How an individual repository connector class instance is managed and used | | Activity infrastructure | The framework API provided to specific methods allowing those methods to perform specific actions within the framework, e.g. recording the activity history; see _org.apache.lcf.crawler.interfaces.IVersionActivity_, and _org.apache.lcf.crawler.interfaces.IProcessActivity_, and _org.apache.lcf.crawler.interfaces.ISeedingActivity_ | | Document specification | A hierarchical structure, internally represented as an XML document, which describes *what* a specific repository connector should do in the context of a specific job; see _org.apache.lcf.crawler.interfaces.DocumentSpecification_ | | Document version string | A simple string, used for comparison purposes, that allows LCF to figure out if a fetch or ingestion operation needs to be repeated as a result of changes to the document specification in effect for a document, or because of changes to the document itself | | Service interruption | A specific kind of exception that signals LCF that the output repository is unavailable, and gives a best estimate of when it might become available again; see _org.apache.lcf.agents.interfaces.ServiceInterruption_ | h3. Implementing the Repository Connector class A very good place to start is to read the javadoc for the repository connector interface. You will note that the javadoc describes the usage and pooling model for a connector class pretty thoroughly. It is very important to understand the model thoroughly in order to write reliable connectors! Use of static variables, for one thing, must be done in a very careful way, to avoid issues that would be hard to detect with a cursory test. The second thing to do is to examine some of the provided repository connector implementations. There are a wide variety of connectors include with LCF that exercise just about every aspect of the repository connector interface. These are: * Documentum (uses RMI to segregate native code, etc.) * FileNet (also uses RMI, but because it is picky about its open-source jar versions) * File system (a good, but simple, example) * LiveLink (demonstrates use of local keystore infrastructure) * Memex * Meridio (local keystore, web services, result sets) * SharePoint (local keystore, web services) * RSS (local keystore, binning) * Web (local database schema, local keystore, binning, events and prerequisites, cache management) You will also note that all of these connectors extend a framework-provided repository connector base class, found at _org.apache.lcf.crawler.connectors.BaseRepositoryConnector_. This base class furnishes some basic bookkeeping logic for managing the connector pool, as well as default implementations of some of the less typical functionality a connector may have. For example, connectors are allowed to have database tables of their own, which are instantiated when the connector is registered, and are torn down when the connector is removed. This is, however, not very typical, and the base implementation reflects that. h5. Principle methods The principle methods an implementer should be concerned with for creating a repository connector are the following: || Method || What it should do || | *addSeedDocuments()* | Use the supplied document specification to come up with an initial set of document identifiers | | *getDocumentVersions()* | Come up with a version string for each of the documents described by the supplied set of document identifiers, or signal if the document is no longer present | | *processDocuments()* | Take the appropriate action (e.g. ingest, or extract references from, or whatever) for a given set of documents described by document identifier and version string | These methods will do the heavy lifting of your connector. But before you can write any code at all, you need to plan things out a bit. h5. Model Each connector must declare a specific model which it adheres to. These models basically describe what the *addSeedDocuments()* method actually does, and are described below. || Model || Description || | _MODEL_ADD_ | The *addSeedDocuments()* method supplies at least all the matching documents that have been added to the repository, within the specified time interval | | _MODEL_ADD_CHANGE_ | The *addSeedDocuments()* method supplies at least those matching documents that have been added or changed in the repository, within the specified time interval | | _MODEL_ADD_CHANGE_DELETE_ | The *addSeedDocuments()* method supplies at least those matching documents that have been added, changed, or removed in the repository, within the specified time interval | | _MODEL_PARTIAL_ | The *addSeedDocuments()* does not return a complete list of documents that match the criteria and time interval, because some of those documents are no longer discoverable | Note that the choice of model is actually much more subtle than the above description might indicate. It may, for one thing, be affected by characteristics of the repository, such as whether the repository considers a document to have been changed if its security information was changed. This would mean that, even though most document changes are picked up and thus one might be tempted to declare the connector to be _MODEL_ADD_CHANGE_, the correct choice would in fact be _MODEL_ADD_. Another subtle point is what documents the connector is actually supposed to return by means of the *addSeedDocuments()* method. The start time and end time parameters handed to the method do not have to be strictly adhered to, for instance; it is always okay to return more documents. It is never okay for the connector to return fewer documents than were requested, on the other hand. h5. Choosing a document identifier In order to decide on the format for a document identifier, you need to understand what this identifier is used for, and what it represents. A document identifier usually corresponds to some entity within the source repository, such as a document or a folder. Note that there is *no* requirement that the identifier represent indexable content. The document identifier must be capable of furnishing enough information to: * Calculate a version string for the document * Find child references for the document * Get the document's content, metadata, and access tokens We highly recommend that no additional information be included in the document identifier, other than what is needed for the above, as that will almost certainly cause problems. h5. Choosing the form of the document version string TODO: More implementation details h3. Implementing a set of Repository Connector JSPs The repository connector class you write provides, through one of its methods, a symbolic name where the crawler UI will look for repository connector UI components. Your components will therefore have the following path, relative to the crawler UI web application: _connectors/_ For a repository connector, you need to furnish the following JSPs: || JSP name || Where it fits || | headerconfig.jsp | Called during the header section of repository connector configuration editing page | | editconfig.jsp | Called during the body section of the repository connector configuration editing page | | postconfig.jsp | Called when configuration editing page is posted, either on a repost or on a save | | viewconfig.jsp | Called when the connection configuration is being viewed | | headerspec.jsp | Called during the header section of a job definition editing page, for which this repository connector has been selected | | editspec.jsp | Called during the body section of a job definition editing page, for which this repository connector has been selected | | postspec.jsp | Called whenever a job definition that uses this repository connector is posted, either for a repost or a save | | viewspec.jsp | Called when a job definition that uses this repository connector is viewed | TODO: More implementation details h3. Implementation support provided by the framework LCF's framework provides a number of helpful services designed to make the creation of a connector easier. These services are summarized below. (This is not an exhaustive list, by any means.) * Lock management and synchronization (see _org.apache.lcf.core.interfaces.LockManagerFactory_) * Cache management (see _org.apache.lcf.core.interfaces.CacheManagerFactory_) * Local keystore management (see _org.apache.lcf.core.KeystoreManagerFactory_) * Database management (see _org.apache.lcf.core.DBInterfaceFactory_) For JSP UI component support, these too are very useful: * Multipart form processing (see _org.apache.lcf.ui.multipart.MultipartWrapper_) * HTML encoding (see _org.apache.lcf.ui.util.Encoder_) * HTML formatting (see _org.apache.lcf.ui.util.Formatter_) h3. DO's and DON'T DO's It's always a good idea to make use of an existing infrastructure component, if it's meant for that purpose, rather than inventing your own. There are, however, some limitations we recommend you adhere to. * DO make use of infrastructure components described in the section above * DON'T make use of infrastructure components that aren't mentioned, without checking first * NEVER write connector code that directly uses framework database tables, other than the ones installed and managed by your connector If you are tempted to violate these rules, it may well mean you don't understand something important. At the very least, we'd like to know why. Send email to connectors-dev@incubator.apache.org with a description of your problem and how you are tempted to solve it. Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action