incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Connectors Framework > ManifoldCF concepts
Date Mon, 04 Oct 2010 18:36:00 GMT
Space: Apache Connectors Framework (
Page: ManifoldCF concepts (

Edited by Karl Wright:
ManifoldCF is a crawler framework which is designed to meet several key goals.

 * It's reliable, and resilient against being shutdown or restarted
 * It's incremental, meaning that jobs describe a set of documents by some criteria, and are
meant to be run again and again to pick up any differences
 * It supports connections to multiple kinds of repositories at the same time
 * It defines and fully supports a model of document security, so that each document listed
in a search result from the back-end search engine is one that the current user is allowed
to see
 * It operates with reasonable efficiency and throughput
 * Its memory usage characteristics are bounded and predictable in advance

ManifoldCF meets many of its architectural goals by being implemented on top of a relational
database.  The current implementation requires Postgresql or uses the included Derby.  Longer
term, we may support other DB bindings.

h1. ManifoldCF document model

Each document in ManifoldCF consists of some opaque binary data, plus some opaque associated
metadata (which is described by name-value pairs), and is uniquely addressed by a URI.  The
back-end search engines which ManifoldCF communicates with are all expected to support, to
a greater or lesser degree, this model.

Documents may also have access tokens associated with them.  These access tokens are described
more fully in the next section.

h1. ManifoldCF security model

The ManifoldCF security model is based loosely on the standard authorization concepts and
hierarchies found in Microsoft's Active Directory.  Active Directory is quite common in the
kinds of environments where data repositories exist that are ripe for indexing.  Active Directory's
authorization model is also easily used in a general way to represent authorization for a
huge variety of third-party content repositories.

ManifoldCF defines a concept of an _access token_.  An access token, to ManifoldCF, is a string
which is meaningful only to a specific connector or connectors.  This string describes the
ability of a user to view (or not view) some set of documents.  For documents protected by
Active Directory itself, an access token would be an Active Directory SID (e.g. "S-1-23-4-1-45").
 But, for example, for documents protected by Livelink a wholly different string would be

In the ManifoldCF security model, it is the job of an _authority_ to provide a list of access
tokens for a given searching user.  Multiple authorities cooperate in that each one can add
to the list of access tokens describing a given user's security.  The resulting access tokens
are handed to the search engine as part of every search request, so that the search engine
may properly exclude documents that the user is not allowed to see.

When document indexing is done, therefore, it is the job of the crawler to hand access tokens
to the search engine, so that it may categorize the documents properly according to their
accessibility.  Note that the access tokens so provided are meaningful only within the space
of the governing authority.  Access tokens can be provided as "grant" tokens, or as "deny"
tokens.  Finally, there are multiple levels of tokens, which correspond to Active Directory's
concepts of "share" security, "directory" security, or "file" security.  (The latter concepts
are rarely used except for documents that come from Windows or Samba systems.)

Once all these documents and their access tokens are handed to the search engine, it is the
search engine's job to enforce security by excluding inappropriate documents from the search
results.  For Solr 1.5, this infrastructure has been submitted in jira ticket SOLR-1895, found
[here|], where you can download a SearchComponent
plug-in and simple instructions for setting up your copy of Solr to enforce ManifoldCF's model
of document security.  Bear in mind that this plug-in is still not a complete solution, as
it requires an authenticated user name to be passed to it from some upstream source, possibly
a JAAS authenticator within an application server framework.

h1. ManifoldCF conceptual entities

h2. Connectors

ManifoldCF defines three different kinds of connectors.  These are:

 * Authority connectors
 * Repository connectors
 * Output connectors

All connectors share certain characteristics.  First, they are pooled.  This means that ManifoldCF
keeps configured and connected instances of a connector around for a while, and has the ability
to limit the total number of such instances to within some upper limit.  Connector implementations
have specific methods in them for managing their existence in the pools that ManifoldCF keeps
them in.  Second, they are configurable.  The configuration description for a connector is
an XML document, whose precise format is determined by the connector implementation.  A configured
connector instance is called a _connection_, by common ManifoldCF convention.

The function of each type of connector is described below.

|| Connector type || Function ||
| Authority connector | Furnishes a standard way of mapping a user name to access tokens that
are meaningful for a given type of repository |
| Repository connector | Fetches documents from a specific kind of repository, such as SharePoint
or off the web |
| Output connector | Pushes document ingestion requests and deletion requests to a specific
kind of back end search engine or other entity, such as Lucene |

h2. Connections

As described above, a _connection_ is a connector implementation plus connector-specific configuration
information.  A user can define a connection of all three types in the crawler UI.

The kind of information included in the configuration data for a connector typically describes
the "how", as opposed to the "what".  For example, you'd configure a Livelink connection by
specifying how to talk to the Livelink server.  You would *not* include information about
which documents to select in such a configuration.

There is one difference between how you define a _repository connection_, vs. how you would
define an _authority connection_ or _output connection_.  The difference is that you must
specify a governing authority connection for your repository connection.  This is because
*all* documents ingested by ManifoldCF need to include appropriate access tokens, and those
access tokens are specific to the governing authority.

h2. Jobs

A _job_ in ManifoldCF parlance is a description of some kind of synchronization that needs
to occur between a specified repository connection and a specified output connection.  A job
includes the following:

 * A verbal description
 * A repository connection (and thus implicitly an authority connection as well)
 * An output connection
 * A repository-connection-specific description of "what" documents and metadata the job applies
 * A model for crawling: either "run to completion", or "run continuously"
 * A schedule for when the job will run: either within specified time windows, or on demand

Jobs are allowed to share the same repository connection, and thus they can overlap in the
set of documents they describe.  ManifoldCF permits this situation, although when it occurs
it is probably an accident.

Change your notification preferences:

View raw message