incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Lucene Connector Framework > How to Build and Deploy Lucene Connector Framework
Date Sun, 21 Feb 2010 16:17:00 GMT
Space: Lucene Connector Framework (
Page: How to Build and Deploy Lucene Connector Framework (

Edited by Karl Wright:
h1. Building LCF

To build Lucene Connector Framework, and the particular connectors you are interested in,
you currently need to do the following:

# Check out [].
# cd to "modules".
# Install desired dependent LGPL and proprietary libraries, wsdls, and xsds.  See below for
# Run ant.

If you supply *no* LGPL or proprietary libraries, the framework itself and only the following
repository connectors will be built:

* Filesystem connector
* JDBC connector, with just the postgresql jdbc driver
* RSS connector
* Webcrawler connector

In addition, the following output connectors will be built:

* MetaCarta GTS output connector
* Lucene SOLR output connector
* Null output connector

The LGPL and proprietary connector dependencies are described in separate sections below.

The output of the ant build is produced in the _modules/dist_ directory, which is further
broken down by process.  The number of produced process directories may vary, because optional
individual connectors do sometimes supply processes that must be run to support the connector.
 See the table below for a description of the _modules/dist_ folder.

|| _modules/dist_ directory || Meaning ||
| _tomcat_ | Web applications that should be deployed on tomcat, plus recommended tomcat -D
switch names and values |
| _processes_ | classpath jars that should be included in the class path for all non-connector-specific
processes, along with -D switches, using the same convention as described for tomcat, above
| _wsdd_ | wsdd files that are needed by the included connectors in order to function |
| _xxx-process_ | classpath jars and -D switches needed for a required connector-specific
process |

In all of the _dist_ directories above, required -D switches are represented by a file name
that is named for the switch, where the desired value of the switch is stored as the file's
contents.  For example, the file "" might have the contents "hello", which should correspond
during deployment to a java switch of the form "".

When you are constructing the appropriate classpath for your LCF processes, it is important
to remember that "more" is not necessarily "better".  The process deployment strategy implied
by the build structure has been carefully thought out to avoid jar conflicts.  Indeed, several
connectors are structured using multiple processes precisely for that reason.

h2. Building the Documentum connector

The Documentum connector requires EMC's DFC product in order to be built.  Install DFC on
the build system, and locate the jars it installs.  You will need to copy at least dfc.jar,
dfcbase.jar, and dctm.jar into the directory "modules/connectors/documentum/dfc".

h2. Building the FileNet connector

The FileNet connector requires IBM's FileNet P8 API jar in order to be build.  Install the
FileNet P8 API on the build system, and copy at least "Jace.jar" from that installation into

h2. Building the JDBC connector, including Oracle, SQLServer, or Sybase JDBC drivers

The JDBC connector also knows how to work with Oracle, SQLServer, and Sybase JDBC drivers.
 For Oracle, download the appropriate Oracle JDBC jar from the Oracle site, and copy it into
the directory "modules/connectors/jdbc/jdbc-drivers".  For SQLServer and Sybase, download
jtds.jar, and copy it into the same directory.

h2. Building the jCIFS connector

To build this connector, you need to download jcifs.jar from, and copy
it into the "modules/connectors/jcifs/jcifs" directory.

h2. Building the LiveLink connector

This connector needs LAPI, which is a proprietary java library that allows access to OpenText's
LiveLink server.  Copy the lapi.jar into the "modules/connectors/livelink/lapi" directory.

h2. Building the Memex connector

This connector needs the Memex API jar, usually called JavaMXIELIB.jar.  Copy this jar into
the "modules/connectors/memex/mxie-java" directory.

h2. Building the Meridio connector

The Meridio connector needs wsdls and xsds downloaded from an installed Meridio instance using
*disco.exe*, which is installed as part of Microsoft Visual Studio, typically under "c:\Program
Files\Microsoft SDKs\Windows\V6.x\bin".  Obtain the preliminary wsdls and xsds by interrogating
the following Meridio web services:

 * http\[s\]://<meridio_server>/DMWS/MeridioDMWS.asmx
 * http\[s\]://<meridio_server>/RMWS/MeridioRMWS.asmx

You should have obtained the following files in this step:

 * MeridioDMWS.wsdl
 * MeridioRMWS.wsdl
 * DMDataSet.xsd
 * RMDataSet.xsd
 * RMClassificationDataSet.xsd

Next, patch these using Microsoft's *xmldiffpatch* utility suite, downloadable for Windows
from [].  The appropriate diff files
to apply as patches can be found in "modules/connectors/meridio/upstream-diffs".  After the
patching, rename so that you have the files:

 * MeridioDMWS_axis.wsdl
 * MeridioRMWS_axis.wsdl
 * DMDataSet_castor.xsd
 * RMDataSet_castor.xsd
 * RMClassificationDataSet_castor.xsd

Finally, copy all of these to: "modules/connectors/meridio/wsdls".

h2. Building the SharePoint connector

In order to build this connector, you need to download wsdls from an installed SharePoint
instance.  The wsdls in question are:

 * Permissions.wsdl
 * Lists.wsdl
 * Dspsts.wsdl
 * usergroup.wsdl
 * versions.wsdl
 * webs.wsdl

To download a wsdl, use Microsoft's *disco.exe* tool, which is part of Visual Studio, typically
under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin".  You'd want to interrogate the following

 * http\[s\]://<server_name>/_vti_bin/Permissions.asmx
 * http\[s\]://<server_name>/_vti_bin/Lists.asmx
 * http\[s\]://<server_name>/_vti_bin/Dspsts.asmx
 * http\[s\]://<server_name>/_vti_bin/usergroup.asmx
 * http\[s\]://<server_name>/_vti_bin/versions.asmx
 * http\[s\]://<server_name>/_vti_bin/webs.asmx

When the wsdl files have been downloaded, copy them to: "modules/connectors/sharepoint/wsdls".

h1. Running Lucene Connector Framework

The core part of Lucene Connector Framework consists of several pieces.  These basic pieces
are enumerated below:

 * A Postgresql database, which is where LCF keeps all of its configuration and state information
 * A synchronization directory, which how LCF coordinates activity among its various processes
 * An *agents* process, which is the process that actually crawls documents and ingests them
 * A *crawler-ui* web application, which presents the UI users interact with to configure
and control the crawler
 * An *authority-service* web application, which responds to requests for authorization tokens,
given a user name

In addition, there are a number of java classes in Lucene Connector Framework that are intended
to be called directly, to perform specific actions in the environment or in the database.
 These classes are usually invoked from the command line, with appropriate arguments supplied,
and are thus considered to be LCF *commands*.  Basic functionality supplied by these command
classes are as follows:

 * Create/Destroy the LCF database instance
 * Start/Stop the *agents* process
 * Register/Unregister an agent class (there's currently only one included)
 * Register/Unregister an output connector
 * Register/Unregister a repository connector
 * Register/Unregister an authority connector
 * Clean up synchronization directory garbage resulting from an ungraceful interruption of
an LCF process
 * Query for certain kinds of job-related information

Individual connectors may contribute additional command classes and processes to this picture.
 A properly built connector typically consists of:

 * One or more jar files meant to be included in the *agents* process and command invocation
 * One or more "iar" incremental war files, which are meant to be unpacked on top of the *lcf-crawler-ui*
or *lcf-authority-service* web applications
 * Possibly some java commands, which are meant to support or configure the connector in some
 * Possibly a connector-specific process or two, each requiring a distinct classpath, which
usually serves to isolate the *crawler-ui* web application, *authority service* web application,
*agents* process, and any commands from problematic aspects of the client environment
 * A recommended set of java "define" variables, which should be used consistently with all
involved processes, e.g. the *agents* process, the application server running the *authority-service*
and *crawler-ui*, and any commands.

An individual connector package will typically supply an output connector, or a repository
connector, or both a repository connector and an authority connector.  The ant build script
under _modules_ automatically forms each individual connector's contribution to the overall
system into the overall package.

h2. Configuring the Postgresql database

Despite having an internal architecture that cleanly abstracts from specific database details,
Lucene Connector Framework is currently fairly specific to Postgresql at this time.  There
are a number of reasons for this.

 # Lucene Connector Framework uses the database for its document queue, which places a significant
load on it.  The back-end database is thus a significant factor in LCF's performance.  But,
in exchange, LCF benefits enormously from the underlying ACID properties of the database.
 # The syntax abstraction is not perfect.  Some details, such as how regular expressions are
handled, have not been abstracted sufficiently at the time of this writing.
 # The strategy for getting optimal query plans from the database is not abstracted.  For
example, Postgresql 8.3+ is very sensitive to certain statistics about a database table, and
will not generate a performant plan if the statistics are inaccurate by even a little, in
some cases.  So, for Postgresql, the database table must be analyzed very frequently, to avoid
catastrophically bad plans.  But luckily, Postgresql is pretty good at doing analysis quickly.
 Oracle, on the other hand, takes a very long time to perform analysis, but its plans are
much less sensitive.
 # Postgresql always does a sequential scan in order to count the number of rows in a table,
while other databases return this efficiently.  This has affected the design of the LCF UI.
 # The choice of query form influences the query plan.  Ideally, this is not true, but for
both Postgresql and for (say) Oracle, it is.
 # Postgresql has a high degree of parallelism and lack of internal single-threadedness.

Lucene Connector Framework has been tested against Postgresql 8.3.7.  We recommend the following
configuration parameter settings to work optimally with LCF:

 * A default database encoding of UTF-8
 * _postgresql.conf_ settings as described in the table below
 * _pg_hba.conf_ settings to allow password access for TCP/IP connections from Lucene Connector
 * A maintenance strategy involving cronjob-style vacuuming, rather than Postgresql autovacuum

|| _postgresql.conf_ parameter || Tested value ||
| shared_buffers | 1024MB |
| checkpoint_segments | 300 |
| maintenance_work_mem | 2MB |
| tcpip_socket | true |
| max_connections | 400 |
| checkpoint_timeout | 900 |
| datastyle | ISO,European |
| autovacuum | off |

h3. A note about maintenance

Postgresql's architecture causes it to accumulate dead tuples in its data files, which do
not interfere with its performance but do bloat the database over time.  The usage pattern
of LCF is such that it can cause significant bloat to occur to the underlying Postgresql database
in only a few days, under sufficient load.  Postgresql has a feature to address this bloat,
called *vacuuming*.  This comes in three varieties: autovacuum, manual vacuum, and manual
full vacuum.

We have found that Postgresql's autovacuum feature is inadequate under such conditions, because
it not only fights for database resources pretty much all the time, but it falls further and
further behind as well.  Postgresql's in-place manual vacuum functionality is a bit better,
but is still much, much slower than actually making a new copy of the database files, which
is what happens when a manual full vacuum is performed.

Dead-tuple bloat also occurs in indexes in Postgresql, so tables that have had a lot of activity
may benefit from being reindexed at the time of maintenance.   
We therefore recommend periodic, scheduled maintenance operations instead, consisting of the

 * REINDEX DATABASE <the_db_name>;
During maintenance, Postgresql locks tables one at a time.  Nevertheless, the crawler ui may
become unresponsive for some operations, such as when counting outstanding documents on the
job status page.  LCF thus has the ability to check for the existence of a file prior to such
sensitive operations, and will display a useful "maintenance in progress" message if that
file is found.  This allows a user to set up a maintenance system that provides adequate feedback
for an LCF user of the overall status of the system.

h2. The LCF configuration file

Currently, LCF requires two configuration files: the property file, and the logging configuration

The property file path can be specified by the system property "org.apache.lcf.configfile".
 If not specified through a -D operation, its name is presumed to be _<user_home>/lcf/properties.ini_.

The configuration file allows several properties to be specified.  One of the optional properties
is the name of the logging configuration file.  This property's name is "org.apache.lcf.logconfigfile".
 If not present, the logging configuration file will be assumed to be _<user_home>/lcf/logging.ini_.
 The logging configuration file is a standard commons-logging property file, and should be
formatted accordingly.

The following table describes the configuration property file properties, and what they do:

|| Property || Required? || Function ||
| org.apache.lcf.synchdirectory | Yes | Specifies the path of a synchronization directory.
 All LCF process owners *must* have read/write privileges to this directory. |
| org.apache.lcf.database.maxhandles | No | Specifies the maximum number of database connection
handles that will by pooled.  Recommended value is 200. |
| org.apache.lcf.database.handletimeout | No | Specifies the maximum time a handle is to live
before it is presumed dead.  Recommend a value of 604800, which is the maximum allowable.
| org.apache.lcf.logconfigfile | No | Specifies location of logging configuration file. |
| | No | Describes database name for LCF; defaults to "dbname"
if not specified. |
| org.apache.lcf.database.username | No | Describes database user name for LCF; defaults to
"lcf" if not specified. |
| org.apache.lcf.database.password | No | Describes database user's password for LCF; defaults
to "local_pg_password" if not specified. |
| com.metacarta.crawler.threads | No | Number of crawler worker threads created.  Suggest
a value of 30. |
| com.metacarta.crawler.deletethreads | No | Number of crawler delete threads created.  Suggest
a value of 10. |
| com.metacarta.misc | No | Miscellaneous debugging output.  Legal values INFO, WARN, or DEBUG.
| com.metacarta.db | No | Database debugging output.  Legal values INFO, WARN, or DEBUG. |
| com.metacarta.lock | No | Lock management debugging output.  Legal values INFO, WARN, or
| com.metacarta.cache | No | Cache management debugging output.  Legal values INFO, WARN,
or DEBUG. |
| com.metacarta.agents | No | Agent management debugging output.  Legal values INFO, WARN,
or DEBUG. |
| com.metacarta.perf | No | Performance logging debugging output.  Legal values INFO, WARN,
or DEBUG. |
| com.metacarta.crawlerthreads | No | Log crawler thread activity.  Legal values INFO, WARN,
or DEBUG. |
| com.metacarta.hopcount | No | Log hopcount tracking activity.  Legal values INFO, WARN,
or DEBUG. |
| | No | Log job activity.  Legal values INFO, WARN, or DEBUG. |
| com.metacarta.connectors | No | Log connector activity.  Legal values INFO, WARN, or DEBUG.
| com.metacarta.scheduling | No | Log document scheduling activity.  Legal values INFO, WARN,
or DEBUG. |
| com.metacarta.authorityconnectors | No | Log authority connector activity.  Legal values
| com.metacarta.authorityservice | No | Log authortity service activity.  Legal values are

h2. Commands

After you have created the necessary configuration files, you will need to initialize the
database, register the "pull-agent" agent, and then register your individual connectors. 
LCF provides a set of commands for performing these actions, and others as well.  The classes
implementing these commands are specified below.

|| Core Command Class || Function ||
| org.apache.lcf.core.DBCreate | Create LCF database instance |
| org.apache.lcf.core.DBDrop | Drop LCF database instance |
| org.apache.lcf.core.LockClean | Clean out synchronization directory |

|| Agents Command Class || Function ||
| org.apache.lcf.agents.Install | Create LCF agents tables |
| org.apache.lcf.agents.Uninstall | Remove LCF agents tables |
| org.apache.lcf.agents.Register | Register an agent class |
| org.apache.lcf.agents.UnRegister | Un-register an agent class |
| org.apache.lcf.agents.UnRegisterAll | Un-register all current agent classes |
| org.apache.lcf.agents.SynchronizeAll | Un-register all registered agent classes that can't
be found |
| org.apache.lcf.agents.RegisterOutput | Register an output connector class |
| org.apache.lcf.agents.UnRegisterOutput | Un-register an output connector class |
| org.apache.lcf.agents.UnRegisterAllOutputs | Un-register all current output connector classes
| org.apache.lcf.agents.SynchronizeOutputs | Un-register all registered output connector classes
that can't be found |
| org.apache.lcf.agents.AgentRun | Main *agents* process class |
| org.apache.lcf.agents.AgentStop | Stops the running *agents* process |

|| Crawler Command Class || Function ||
| org.apache.lcf.crawler.Register | Register a repository connector class |
| org.apache.lcf.crawler.UnRegister | Un-register a repository connector class |
| org.apache.lcf.crawler.UnRegisterAll | Un-register all repository connector classes |
| org.apache.lcf.crawler.SynchronizeConnectors | Un-register all registered repository connector
classes that can't be found |
| org.apache.lcf.crawler.ExportConfiguration | Export crawler configuration to a file |
| org.apache.lcf.crawler.ImportConfiguration | Import crawler configuration from a file |

|| Authority Command Class || Function ||
| org.apache.lcf.authorities.RegisterAuthority | Register an authority connector class |
| org.apache.lcf.authorities.UnRegisterAuthority | Un-register an authority connector class
| org.apache.lcf.authorities.UnRegisterAllAuthorities | Un-register all authority connector
classes |
| org.apache.lcf.authorities.SynchronizeAuthorities | Un-register all registered authority
connector classes that can't be found |

Remember that you need to include all the jars under _module/dist/processes_ in the classpath
whenever you run one of these commands!  You also must include the corresponding -D switches,
as described earlier.

h2. Deploying the *lcf-crawler-ui* and *lcf-authority-service* web applications

If you built LCF using ant under the _modules_ directory, then the ant build will have constructed
two war files for you under _modules/dist/tomcat_.  Take these war files and deploy them as
web applications under one or more instances of tomcat.  There is no requirement that the
*lcf-crawler-ui* web application and the *lcf-authority-service* web application be deployed
on the same instance of tomcat.  With the current architecture of LCF, they must be deployed
on the same server, however.

Under _modules/dist/tomcat_, you may also see files that are not war files.  These files are
meant to be used as command-line -D switches for the tomcat process.  The switches may or
may not be identical for the two web applications, but they will never conflict.  You may
need to alter environment variables or your tomcat startup scripts in order to provide these
switches.  (More about this in the future...)

h2. Running the *agents* process

to be continued

h2. Running connector-specific processes

to be continued

Change your notification preferences:

View raw message