incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Lucene Connector Framework > Lucene Connector Framework concepts
Date Sun, 21 Feb 2010 18:33:00 GMT
Space: Lucene Connector Framework (http://cwiki.apache.org/confluence/display/CONNECTORS)
Page: Lucene Connector Framework concepts (http://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connector+Framework+concepts)

Added by Karl Wright:
---------------------------------------------------------------------
Lucene Connector Framework is a crawler framework which is designed to meet several key goals.

 * It's reliable, and resilient against being shutdown or restarted
 * It's incremental, meaning that jobs describe a set of documents by some criteria, and are
meant to be run again and again to pick up any differences
 * It supports connections to multiple kinds of repositories at the same time
 * It defines and fully supports a model of document security, so that each document listed
in a search result from the back-end search engine is one that the current user is allowed
to see
 * It operates with reasonable efficiency and throughput
 * Its memory usage characteristics are bounded and predictable in advance

LCF meets many of its architectural goals by being implemented on top of a relational database.
 The current implementation requires Postgresql, which is by far the richest open-source database
available.

h1. Lucene Connector Framework document model

Each document in LCF consists of some opaque binary data, plus some opaque associated metadata
(which is described by name-value pairs), and is uniquely addressed by a URI.  The back-end
search engines which LCF communicates with are all expected to support, to a greater or lesser
degree, this model.

Documents may also have access tokens associated with them.  These access tokens are described
more fully in the next section.

h1. Lucene Connector Framework security model

h2. Access tokens

to be continued

h1. Lucene Connector Framework conceptual entities

h2. Connectors

LCF defines three different kinds of connectors.  These are:

 * Authority connectors
 * Repository connectors
 * Output connectors

All connectors share certain characteristics.  First, they are pooled.  This means that LCF
keeps configured and connected instances of a connector around for a while, and has the ability
to limit the total number of such instances to within some upper limit.  Connector implementations
have specific methods in them for managing their existence in the pools that LCF keeps them
in.  Second, they are configurable.  The configuration description for a connector is an XML
document, whose precise format is determined by the connector implementation.  A configured
connector instance is called a _connection_, by common LCF convention.

The function of each type of connector is described below.

|| Connector type || Function ||
| Authority connector | Furnishes a standard way of mapping a user name to access tokens that
are meaningful for a given type of repository |
| Repository connector | Fetches documents from a specific kind of repository, such as SharePoint
or off the web |
| Output connector | Pushes document ingestion requests and deletion requests to a specific
kind of back end search engine or other entity, such as Lucene |

h2. Connections

to be continued

h2. Jobs

to be continued






Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message