incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Lucene Connector Framework > How to Write an Output Connector
Date Tue, 09 Mar 2010 18:47:00 GMT
Space: Lucene Connector Framework (http://cwiki.apache.org/confluence/display/CONNECTORS)
Page: How to Write an Output Connector (http://cwiki.apache.org/confluence/display/CONNECTORS/How+to+Write+an+Output+Connector)


Edited by Karl Wright:
---------------------------------------------------------------------
h1. Writing an Output Connector

An output connector furnishes the mechanism by which content that has been fetched from a
repository gets handed to a back-end repository for processing.  It also furnishes a mechanism
for removing previously-processed content from that back end repository.

As is the case with all connectors under the LCF umbrella, an output connector consists of
two parts:

* A class implementing an interface (in this case, _org.apache.lcf.agents.interfaces.IOutputConnector_)
* A set of JSP's that implement the crawler UI for the connector

h3. Key concepts

The output connector abstraction makes use of, or introduces, the following concepts:

|| Concept || What it is ||
| Configuration parameters | A hierarchical structure, internally represented as an XML document,
which describes a specific configuration of a specific output connector, i.e. *how* the connector
should do its job; see _org.apache.lcf.core.interfaces.ConfigParams_ |
| Output connection | An output connector instance that has been furnished with configuration
data |
| Document URI | The unique URI (or, in some cases, file IRI) of a document, which is meant
to be displayed in search engine results as the link to the document |
| Repository document | An object that describes a document's contents, including raw document
data (as a stream), metadata (as either strings or streams), and access tokens; see _org.apache.lcf.agents.interfaces.RepositoryDocument_
|
| Connection management/threading/pooling model | How an individual output connector class
instance is managed and used |
| Activity infrastructure | The framework API provided to specific methods allowing those
methods to perform specific actions within the framework, e.g. recording activities; see _org.apache.lcf.agents.interfaces.IOutputAddActivity_
and _org.apache.lcf.agents.interfaces.IOutputRemoveActivity_ |
| Output specification | A hierarchical structure, internally represented as an XML document,
which describes *what* a specific output connector should do in the context of a specific
job; see _org.apache.lcf.agents.interfaces.OutputSpecification_ |
| Output version string | A simple string, used for comparison purposes, that allows LCF to
figure out if an ingestion operation needs to be repeated as a result of changes to the output
specification in effect for a document |
| Service interruption | A specific kind of exception that signals LCF that the output repository
is unavailable, and gives a best estimate of when it might become available again; see _org.apache.lcf.agents.interfaces.ServiceInterruption_
|


h3. Implementing the Output Connector class

A very good place to start is to read the javadoc for the output connector interface.  You
will note that the javadoc describes the usage and pooling model for a connector class pretty
thoroughly.  It is very important to understand the model thoroughly in order to write reliable
connectors!  Use of static variables, for one thing, must be done in a very careful way, to
avoid issues that would be hard to detect with a cursory test.

The second thing to do is to examine some of the provided output connector implementations.
 The GTS connector, the SOLR connector, and the Null Output connector all are output connectors
which demonstrate (to some degree) the sorts of techniques you will need for an effective
implementation.  You will also note that all of these connectors extend a framework-provided
output connector base class, found at _org.apache.lcf.agents.output.BaseOutputConnector_.
 This base class furnishes some basic bookkeeping logic for managing the connector pool, as
well as default implementations of some of the less typical functionality a connector may
have.  For example, connectors are allowed to have database tables of their own, which are
instantiated when the connector is registered, and are torn down when the connector is removed.
 This is, however, not very typical, and the base implementation reflects that.

h5. Principle methods

The principle methods an implementer should be concerned with for creating an output connector
are the following:

|| Method || What it should do ||
| *getOutputDescription()* | Use the supplied output specification to come up with an output
version string |
| *addOrReplaceDocument()* | Add or replace the specified document within the target repository,
or signal if the document cannot be handled |
| *removeDocument()* | Remove the specified document from the target repository |

These methods will do the heavy lifting of your connector.  But before you can write any code
at all, you need to plan things out a bit.

h5.  Choosing the form of the output version string

The output version string is used by LCF to determine whether or not the output specification
or configuration changed in such a way as to require that the document be reprocessed.  LCF
therefore requests the output version string for any document that is ready for processing,
and usually does not process the document again if the returned output version string agrees
with the output version string it has stored.

Thinking about it more carefully, it is clear that what an output connector writer needs to
do is include everything in the output version string that could potentially affect how the
document gets ingested, save that which is specific to the repository connector.  That may
include bits of output connector configuration information, as well as data from the output
specification.  When it's time to ingest, it's usually the correct thing to do to obtain the
necessary data for ingestion out of the output version string, rather than calculating it
or fetching it anew, because that guarantees that the document processing was done in a manner
that agrees with its recorded output version string, thus eliminating any chance of LCF getting
confused.

h3. Implementing a set of Output Connector JSPs

The output connector class you write provides, through one of its methods, a symbolic name
where the crawler UI will look for output connector UI components.  Your components will therefore
have the following path, relative to the crawler UI web application:

_output/<connector_symbolic_name>_

For an output connector, you need to furnish the following JSPs:

|| JSP name || Where it fits ||
| *headerconfig.jsp* | Called during the header section of output connector configuration
editing page |
| *editconfig.jsp* | Called during the body section of the output connector configuration
editing page |
| *postconfig.jsp* | Called when configuration editing page is posted, either on a repost
or on a save |
| *viewconfig.jsp* | Called when the connection configuration is being viewed |
| *headerspec.jsp* | Called during the header section of a job definition editing page, for
which this output connector has been selected |
| *editspec.jsp* | Called during the body section of a job definition editing page, for which
this output connector has been selected |
| *postspec.jsp* | Called whenever a job definition that uses this output connector is posted,
either for a repost or a save |
| *viewspec.jsp* | Called when a job definition that uses this output connector is viewed
|

As you might be able to tell, the "config" elements are responsible for editing and viewing
a _ConfigParam_ object, while the "spec" elements are responsible for editing and viewing
an _OutputSpecification_ object.

The crawler UI uses a tabbed layout structure, and thus each of these elements must properly
implement the tabbed model.  This means that the "header" elements above must add the desired
tab names to a specified array, and the "edit" elements must provide JSP code that handles
both the case where a tab is displayed, and where it is not displayed.  Also, it makes sense
to use the appropriate css definitions, so that the connector JSPs have a similar look-and-feel
to the rest of LCF's crawler ui.  We strongly suggest starting with one of the supplied connector's
UI code, both for a description of the arguments to each page, and for some decent ideas of
ways to organize your connector's UI code.

Please also note that it is good practice to name the form fields in your elements in such
a way that they cannot collide with form fields that may come from the framework's elements
or any specific repository connector's elements.  The "spec" elements especially may be prone
to collisions, because within any given job, "spec" elements from the chosen output connector
are called in the same page as "spec" elements for the chosen repository connector.


h3. Implementation support provided by the framework

LCF's framework provides a number of helpful services designed to make the creation of a connector
easier.  These services are summarized below.  (This is not an exhaustive list, by any means.)

* Lock management and synchronization (see _org.apache.lcf.core.interfaces.LockManagerFactory_)
* Cache management (see _org.apache.lcf.core.interfaces.CacheManagerFactory_)
* Local keystore management (see _org.apache.lcf.core.KeystoreManagerFactory_)
* Database management (see _org.apache.lcf.core.DBInterfaceFactory_)

For JSP UI component support, these too are very useful:

* Multipart form processing (see _org.apache.lcf.ui.multipart.MultipartWrapper_)
* HTML encoding (see _org.apache.lcf.ui.util.Encoder_)
* HTML formatting (see _org.apache.lcf.ui.util.Formatter_)

h3. DO's and DON'T DO's

It's always a good idea to make use of an existing infrastructure component, if it's meant
for that purpose, rather than inventing your own.  There are, however, some limitations we
recommend you adhere to.

* DO make use of infrastructure components described in the section above
* DON'T make use of infrastructure components that aren't mentioned, without checking first
* NEVER write connector code that directly uses framework database tables, other than the
ones installed and managed by your connector

If you are tempted to violate these rules, it may well mean you don't understand something
important.  At the very least, we'd like to know why.  Send email to connectors-dev@incubator.apache.org
with a description of your problem and how you are tempted to solve it.


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message