Return-Path: Delivered-To: apmail-incubator-connectors-commits-archive@minotaur.apache.org Received: (qmail 39718 invoked from network); 12 Nov 2010 21:34:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Nov 2010 21:34:15 -0000 Received: (qmail 18087 invoked by uid 500); 12 Nov 2010 21:34:47 -0000 Delivered-To: apmail-incubator-connectors-commits-archive@incubator.apache.org Received: (qmail 18050 invoked by uid 500); 12 Nov 2010 21:34:47 -0000 Mailing-List: contact connectors-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-dev@incubator.apache.org Delivered-To: mailing list connectors-commits@incubator.apache.org Received: (qmail 18043 invoked by uid 99); 12 Nov 2010 21:34:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Nov 2010 21:34:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Nov 2010 21:34:38 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 4139C23889E9; Fri, 12 Nov 2010 21:33:22 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1034577 [2/2] - in /incubator/lcf/site: publish/ src/documentation/content/xdocs/ Date: Fri, 12 Nov 2010 21:33:21 -0000 To: connectors-commits@incubator.apache.org From: kwright@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20101112213322.4139C23889E9@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml?rev=1034577&view=auto ============================================================================== --- incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml (added) +++ incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml Fri Nov 12 21:33:20 2010 @@ -0,0 +1,613 @@ + + + + + + +
+ Building ManifoldCF +
+ + +
+ Building ManifoldCF +

+

ManifoldCF consists of the framework itself, a set of connectors, and an Apache2 plug-in module. These can be built as follows.

+

+ +
+ Building the framework and the connectors +

+

To build the ManifoldCF framework code, and the particular connectors you are interested in, you currently need to do the following:

+

+
    +
  1. Check out https://svn.apache.org/repos/asf/incubator/lcf/trunk.
  2. +
  3. cd to "modules".
  4. +
  5. Install desired dependent LGPL and proprietary libraries, wsdls, and xsds. See below for details.
  6. +
  7. Run ant.
  8. +
+

+

If you supply no LGPL or proprietary libraries, the framework itself and only the following repository connectors will be built:

+

+
    +
  • Active Directory authority
  • +
  • Filesystem connector
  • +
  • JDBC connector, with just the postgresql jdbc driver
  • +
  • RSS connector
  • +
  • Webcrawler connector
  • +
+

+

In addition, the following output connectors will be built:

+

+
    +
  • MetaCarta GTS output connector
  • +
  • Apache Solr output connector
  • +
  • Null output connector
  • +
  • Null authority connector
  • +
+

+

The LGPL and proprietary connector dependencies are described in separate sections below.

+

+

The output of the ant build is produced in the modules/dist directory, which is further broken down by process. The number of produced process directories may vary, because optional individual connectors do sometimes supply processes that must be run to support the connector. See the table below for a description of the modules/dist folder.

+

+ + + + + + + + + + +
Distribution directories
modules/dist directoryMeaning
webWeb applications that should be deployed on tomcat or the equivalent, plus recommended application server -D switch names and values
processesclasspath jars that should be included in the class path for all non-connector-specific processes, along with -D switches, using the same convention as described for tomcat, above
libjars for all the connector plugins, which should be referenced by the appropriate clause in the ManifoldCF configuration file
wsddwsdd files that are needed by the included connectors in order to function
xxx-processclasspath jars and -D switches needed for a required connector-specific process
examplea jetty-based example that runs in a single process (except for any connector-specific processes)
docjavadocs for framework and all included connectors
+

+

For all of the dist subdirectories above (except for wsdd, which does not correspond to a process), any scripts resulting from the build that pertain to that process will be placed in a script subdirectory. Thus, the command for executing a command under Windows for the processes subdirectory will be found in dist/processes/script/executecommand.bat. (This script requires two variables to be set before execution: JAVA_HOME, and MCF_HOME, which should point to ManifoldCF's home execution directory, described below.) Indeed, everything you need to run an ManifoldCF process can be found under dist/processes when the ant build completes: a define subdirectory containing -D switch description files, a jar subdirectory where jars are placed, and a war subdirectory where war files are output.

+

+

The supplied scripts in the script directory for a process generally take care of building an appropriate classpath and set of -D switches. If you need to construct a classpath by hand, it is important to remember that "more" is not necessarily "better". The process deployment strategy implied by the build structure has been carefully thought out to avoid jar conflicts. Indeed, several connectors are structured using multiple processes precisely for that reason.

+

+ +
+ Building the Documentum connector +

+

The Documentum connector requires EMC's DFC product in order to be built. Install DFC on the build system, and locate the jars it installs. You will need to copy at least dfc.jar, dfcbase.jar, and dctm.jar into the directory "modules/connectors/documentum/dfc".

+

+
+ +
+ Building the FileNet connector +

+

The FileNet connector requires IBM's FileNet P8 API jar in order to be build. Install the FileNet P8 API on the build system, and copy at least "Jace.jar" from that installation into "modules/connectors/filenet/filenet-api".

+

+
+ +
+ Building the JDBC connector, including Oracle, SQLServer, or Sybase JDBC drivers +

+

The JDBC connector also knows how to work with Oracle, SQLServer, and Sybase JDBC drivers. For Oracle, download the appropriate Oracle JDBC jar from the Oracle site, and copy it into the directory "modules/connectors/jdbc/jdbc-drivers". For SQLServer and Sybase, download jtds.jar, and copy it into the same directory.

+

+
+ +
+ Building the jCIFS connector +

+

To build this connector, you need to download jcifs.jar from http://samba.jcifs.org, and copy it into the "modules/connectors/jcifs/jcifs" directory.

+

+
+ +
+ Building the LiveLink connector +

+

This connector needs LAPI, which is a proprietary java library that allows access to OpenText's LiveLink server. Copy the lapi.jar into the "modules/connectors/livelink/lapi" directory.

+

+
+ +
+ Building the Memex connector +

+

This connector needs the Memex API jar, usually called JavaMXIELIB.jar. Copy this jar into the "modules/connectors/memex/mxie-java" directory.

+

+
+ +
+ Building the Meridio connector +

+

The Meridio connector needs wsdls and xsds downloaded from an installed Meridio instance using disco.exe, which is installed as part of Microsoft Visual Studio, typically under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". Obtain the preliminary wsdls and xsds by interrogating the following Meridio web services:

+

+
    +
  • http[s]://<meridio_server>/DMWS/MeridioDMWS.asmx
  • +
  • http[s]://<meridio_server>/RMWS/MeridioRMWS.asmx
  • +
+

+

You should have obtained the following files in this step:

+

+
    +
  • MeridioDMWS.wsdl
  • +
  • MeridioRMWS.wsdl
  • +
  • DMDataSet.xsd
  • +
  • RMDataSet.xsd
  • +
  • RMClassificationDataSet.xsd
  • +
+

+

Next, patch these using Microsoft's xmldiffpatch utility suite, downloadable for Windows from http://msdn.microsoft.com/en-us/library/aa302294.aspx. The appropriate diff files to apply as patches can be found in "modules/connectors/meridio/upstream-diffs". After the patching, rename so that you have the files:

+

+
    +
  • MeridioDMWS_axis.wsdl
  • +
  • MeridioRMWS_axis.wsdl
  • +
  • DMDataSet_castor.xsd
  • +
  • RMDataSet_castor.xsd
  • +
  • RMClassificationDataSet_castor.xsd
  • +
+

+

Finally, copy all of these to: "modules/connectors/meridio/wsdls".

+

+
+ +
+ Building the SharePoint connector +

+

In order to build this connector, you need to download wsdls from an installed SharePoint instance. The wsdls in question are:

+

+
    +
  • Permissions.wsdl
  • +
  • Lists.wsdl
  • +
  • Dspsts.wsdl
  • +
  • usergroup.wsdl
  • +
  • versions.wsdl
  • +
  • webs.wsdl
  • +
+

+

To download a wsdl, use Microsoft's disco.exe tool, which is part of Visual Studio, typically under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". You'd want to interrogate the following urls:

+

+
    +
  • http[s]://<server_name>/_vti_bin/Permissions.asmx
  • +
  • http[s]://<server_name>/_vti_bin/Lists.asmx
  • +
  • http[s]://<server_name>/_vti_bin/Dspsts.asmx
  • +
  • http[s]://<server_name>/_vti_bin/usergroup.asmx
  • +
  • http[s]://<server_name>/_vti_bin/versions.asmx
  • +
  • http[s]://<server_name>/_vti_bin/webs.asmx
  • +
+

+

When the wsdl files have been downloaded, copy them to: "modules/connectors/sharepoint/wsdls".

+

+

Note well: For SharePoint instances version 3.0 or higher, in order to support file and folder level security, you also must deploy a custom SharePoint web service on the SharePoint instance you intend to connect to. This is because Microsoft apparently overlooked support for web-service-based access to such security information when SharePoint 3.0 was released.

+

+

In order to build the service, you need to have access to a Windows machine that has a reasonably current version of Microsoft Visual Studio available, with .NET installed and (at least) SharePoint 2.0 installed as well. The fastest way to build the service is to do the following after building everything else:

+

+ +cd connectors/sharepoint +ant build-webservice +cd webservice/Package + +

+

Then, follow the directions in the file "Installation Readme.txt", found in that directory.

+

+

+
+ +
+ Building ManifoldCF's Apache2 plugin +

+

To build the mod-authz-annotate plugin, you need to start with a Unix system that has the apache2 development tools installed on it, plus the curl development package (from http://curl.haxx.se or elsewhere). Then, cd to modules/mod-authz-annotate, and type "make". The build will produce a file called mod-authz-annotate.so, which should be copied to the appropriate Apache2 directory so it can be used as a plugin.

+

+

+
+ +
+
+ +
+ Running ManifoldCF +

+
+ Quick start +

+

You can run most of ManifoldCF in a single process, for evaluation and convenience. This single-process version uses Jetty to handle its web applications, and Derby as an embedded database. All you need to do to run this version of ManifoldCF is to follow the build instructions above, and then:

+

+ +cd dist/example +<java> -jar start.jar + +

+

In this jetty setup, all database initialization and connector registration takes place automatically (at the cost of some startup delay). The crawler UI can be found at http://>host<:8345/mcf-crawler-ui. The authority service can be found at http://>host<:8345/mcf-authority-service. The programmatic API is at http://>host<:8345/mcf-api.

+

+

You can stop ManifoldCF at any time using ^C.

+

+

Bear in mind that Derby is not as full-featured a database as is Postgresql. This means that any performance testing you may do against the quick start example may not be applicable to a full installation. Furthermore, Derby only permits one process at a time to be connected to its databases, so you cannot use any of the ManifoldCF commands (as described below) while the quick-start ManifoldCF is running.

+

+

Another caveat that you will need to be aware of with the quick-start version of ManifoldCF is that it in no way removes the need for you to run any separate processes that individual connectors require. Specifically, the Documentum and FileNet connectors require processes to be independently started in order to function. You will need to read about these connector-specific processes below in order to use the corresponding connectors.

+

+
+ The quick-start connectors.xml configuration file +

+

The quick-start version of ManifoldCF reads its own configuration file, called connectors.xml, in order to register the available connectors in the database. The file has this basic format:

+

+ +<?xml version="1.0" encoding="UTF-8" ?> +<connectors> + (clauses) +</connectors> + +

+

The following tags are available to specify your connectors:

+

+

<repositoryconnector name="pretty_name" class="connector_class"/>

+

<authorityconnector name="pretty_name" class="connector_class"/>

+

<outputconnector name="pretty_name" class="connector_class"/>

+

+
+
+ +
+ Framework and connectors +

+

The core part of ManifoldCF consists of several pieces. These basic pieces are enumerated below:

+

+
    +
  • A database, which is where ManifoldCF keeps all of its configuration and state information, usually Postgresql
  • +
  • A synchronization directory, which how ManifoldCF coordinates activity among its various processes
  • +
  • An agents process, which is the process that actually crawls documents and ingests them
  • +
  • A crawler-ui web application, which presents the UI users interact with to configure and control the crawler
  • +
  • An authority-service web application, which responds to requests for authorization tokens, given a user name
  • +
  • An api-service web application, which responds to REST API requests
  • +
+

+

In addition, there are a number of java classes in ManifoldCF that are intended to be called directly, to perform specific actions in the environment or in the database. These classes are usually invoked from the command line, with appropriate arguments supplied, and are thus considered to be ManifoldCF commands. Basic functionality supplied by these command classes are as follows:

+

+
    +
  • Create/Destroy the ManifoldCF database instance
  • +
  • Start/Stop the agents process
  • +
  • Register/Unregister an agent class (there's currently only one included)
  • +
  • Register/Unregister an output connector
  • +
  • Register/Unregister a repository connector
  • +
  • Register/Unregister an authority connector
  • +
  • Clean up synchronization directory garbage resulting from an ungraceful interruption of an ManifoldCF process
  • +
  • Query for certain kinds of job-related information
  • +
+

+

Individual connectors may contribute additional command classes and processes to this picture. A properly built connector typically consists of:

+

+
    +
  • One or more jar files meant to be included in the library area meant for connector jars and their dependencies.
  • +
  • Possibly some java commands, which are meant to support or configure the connector in some way.
  • +
  • Possibly a connector-specific process or two, each requiring a distinct classpath, which usually serves to isolate the crawler-ui web application, authority-service web application, agents process, and any commands from problematic aspects of the client environment
  • +
  • A recommended set of java "define" variables, which should be used consistently with all involved processes, e.g. the agents process, the application server running the authority-service and crawler-ui, and any commands. (This is historical, and no connectors as of this writing have any of these any longer).
  • +
+

+

An individual connector package will typically supply an output connector, or a repository connector, or both a repository connector and an authority connector. The ant build script under modules automatically forms each individual connector's contribution to the overall system into the overall package.

+

+

The basic steps required to set up and run ManifoldCF are as follows:

+

+
    +
  • Check out and build, using "ant". The default target builds everything.
  • +
  • Install postgresql. The postgresql JDBC driver included with ManifoldCF is known to work with version 8.3.x, so that version is the currently recommended one. Configure postgresql for your environment; the default configuration is acceptable for testing and experimentation.
  • +
  • Install a Java application server, such as Tomcat.
  • +
  • Create a home directory for ManifoldCF. To do this, make a copy of the contents of modules/dist from the build. In this directory, create properties.ini and logging.ini, as described above. Note that you will also need to create a synchronization directory, also detailed above, and refer to this directory within your properties.xml.
  • +
  • Deploy the war files in <MCF_HOME>/web/war to your application server.
  • +
  • Set the starting environment variables for your app server to include the -D commands found in <MCF_HOME>/web/define. The -D commands should be of the form, "-D<file name>=<file contents>".
  • +
  • Use the <MCF_HOME>/processes/script/executecommand.bat command from execute the appropriate commands from the next section below, being sure to first set the JAVA_HOME and MCF_HOME environment variables properly.
  • +
  • Start any supporting processes that result from your build. (Some connectors such as Documentum and FileNet have auxiliary processes you need to run to make these connectors functional.)
  • +
  • Start your application server.
  • +
  • Start the ManifoldCF agents process.
  • +
  • At this point, you should be able to interact with the ManifoldCF UI, which can be accessed via the mcf-crawler-ui web application
  • +
+

+

For each of the described steps, details are furnished in the steps below.

+

+

+
+ Configuring the Postgresql database +

+

Despite having an internal architecture that cleanly abstracts from specific database details, ManifoldCF is currently fairly specific to Postgresql at this time. There are a number of reasons for this.

+

+
    +
  • ManifoldCF uses the database for its document queue, which places a significant load on it. The back-end database is thus a significant factor in ManifoldCF's performance. But, in exchange, ManifoldCF benefits enormously from the underlying ACID properties of the database.
  • +
  • The strategy for getting optimal query plans from the database is not abstracted. For example, Postgresql 8.3+ is very sensitive to certain statistics about a database table, and will not generate a performant plan if the statistics are inaccurate by even a little, in some cases. So, for Postgresql, the database table must be analyzed very frequently, to avoid catastrophically bad plans. But luckily, Postgresql is pretty good at doing analysis quickly. Oracle, on the other hand, takes a very long time to perform analysis, but its plans are much less sensitive.
  • +
  • Postgresql always does a sequential scan in order to count the number of rows in a table, while other databases return this efficiently. This has affected the design of the ManifoldCF UI.
  • +
  • The choice of query form influences the query plan. Ideally, this is not true, but for both Postgresql and for (say) Oracle, it is.
  • +
  • Postgresql has a high degree of parallelism and lack of internal single-threadedness.
  • +
+

+

ManifoldCF has been tested against Postgresql 8.3.7. We recommend the following configuration parameter settings to work optimally with ManifoldCF:

+

+
    +
  • A default database encoding of UTF-8
  • +
  • postgresql.conf settings as described in the table below
  • +
  • pg_hba.conf settings to allow password access for TCP/IP connections from ManifoldCF
  • +
  • A maintenance strategy involving cronjob-style vacuuming, rather than Postgresql autovacuum
  • +
+

+ + + + + + + + + + + +
Postgresql.conf parameters
postgresql.conf parameterTested value
shared_buffers1024MB
checkpoint_segments300
maintenanceworkmem2MB
tcpip_sockettrue
max_connections400
checkpoint_timeout900
datastyleISO,European
autovacuumoff
+

+
+
+ A note about maintenance +

+

Postgresql's architecture causes it to accumulate dead tuples in its data files, which do not interfere with its performance but do bloat the database over time. The usage pattern of ManifoldCF is such that it can cause significant bloat to occur to the underlying Postgresql database in only a few days, under sufficient load. Postgresql has a feature to address this bloat, called vacuuming. This comes in three varieties: autovacuum, manual vacuum, and manual full vacuum.

+

+

We have found that Postgresql's autovacuum feature is inadequate under such conditions, because it not only fights for database resources pretty much all the time, but it falls further and further behind as well. Postgresql's in-place manual vacuum functionality is a bit better, but is still much, much slower than actually making a new copy of the database files, which is what happens when a manual full vacuum is performed.

+

+

Dead-tuple bloat also occurs in indexes in Postgresql, so tables that have had a lot of activity may benefit from being reindexed at the time of maintenance.

+

We therefore recommend periodic, scheduled maintenance operations instead, consisting of the following:

+

+
    +
  • VACUUM FULL VERBOSE;
  • +
  • REINDEX DATABASE <the_db_name>;
  • +
+

+

During maintenance, Postgresql locks tables one at a time. Nevertheless, the crawler ui may become unresponsive for some operations, such as when counting outstanding documents on the job status page. ManifoldCF thus has the ability to check for the existence of a file prior to such sensitive operations, and will display a useful "maintenance in progress" message if that file is found. This allows a user to set up a maintenance system that provides adequate feedback for an ManifoldCF user of the overall status of the system.

+

+
+
+ The ManifoldCF configuration file +

+

Currently, ManifoldCF requires two configuration files: the main configuration property file, and the logging configuration file.

+

+

The property file path can be specified by the system property "org.apache.manifoldcf.configfile". If not specified through a -D operation, its name is presumed to be <user_home>/lcf/properties.xml. The form of the property file is XML, of the following basic form:

+

+ +<?xml version="1.0" encoding="UTF-8" ?> +<configuration> + (clauses) +</configuration> + +

+
+
+ Properties +

+

The configuration file allows properties to be specified. A property clause has the form:

+

+

<property name="property_name" value="property_value"/>

+

+

One of the optional properties is the name of the logging configuration file. This property's name is "org.apache.manifoldcf.logconfigfile". If not present, the logging configuration file will be assumed to be <user_home>/manifoldcf/logging.ini. The logging configuration file is a standard commons-logging property file, and should be formatted accordingly.

+

+

Note that all properties described below can also be specified on the command line, via a -D switch. If both methods of setting the property are used, the -D switch value will override the property file value.

+

+

The following table describes the configuration property file properties, and what they do:

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property.xml properties
PropertyRequired?Function
org.apache.manifoldcf.lockmanagerclassNoSpecifies the class to use to implement synchronization. Default is a built-in file-based synchronization class.
org.apache.manifoldcf.databaseimplementationclassNoSpecifies the class to use to implement database access. Default is a built-in Postgresql implementation.
org.apache.manifoldcf.synchdirectoryYes, if file-based synchronization class is usedSpecifies the path of a synchronization directory. All ManifoldCF process owners must have read/write privileges to this directory.
org.apache.manifoldcf.database.maxhandlesNoSpecifies the maximum number of database connection handles that will by pooled. Recommended value is 200.
org.apache.manifoldcf.database.handletimeoutNoSpecifies the maximum time a handle is to live before it is presumed dead. Recommend a value of 604800, which is the maximum allowable.
org.apache.manifoldcf.logconfigfileNoSpecifies location of logging configuration file.
org.apache.manifoldcf.database.nameNoDescribes database name for ManifoldCF; defaults to "dbname" if not specified.
org.apache.manifoldcf.database.usernameNoDescribes database user name for ManifoldCF; defaults to "manifoldcf" if not specified.
org.apache.manifoldcf.database.passwordNoDescribes database user's password for ManifoldCF; defaults to "local_pg_password" if not specified.
org.apache.manifoldcf.crawler.threadsNoNumber of crawler worker threads created. Suggest a value of 30.
org.apache.manifoldcf.crawler.deletethreadsNoNumber of crawler delete threads created. Suggest a value of 10.
org.apache.manifoldcf.miscNoMiscellaneous debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.dbNoDatabase debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.lockNoLock management debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.cacheNoCache management debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.agentsNoAgent management debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.perfNoPerformance logging debugging output. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.crawlerthreadsNoLog crawler thread activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.hopcountNoLog hopcount tracking activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.jobsNoLog job activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.connectorsNoLog connector activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.schedulingNoLog document scheduling activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.authorityconnectorsNoLog authority connector activity. Legal values INFO, WARN, or DEBUG.
org.apache.manifoldcf.authorityserviceNoLog authority service activity. Legal values are INFO, WARN, or DEBUG.
org.apache.manifoldcf.sharepoint.wsddpathYes, for SharePoint ConnectorPath to the SharePoint Connector wsdd file.
org.apache.manifoldcf.meridio.wsddpathYes, for Meridio ConnectorPath to the Meridio Connector wsdd file.
+

+
+
+ Class path libraries +

+

The configuration file can also specify a set of directories which will be searched for connector jars. The directive that adds to the class path is:

+

+

<libdir path="path"/>

+

+

Note that the path can be relative. For the purposes of path resolution, "." means the directory in which the properties.xml file is located.

+

+
+
+ Examples +

+

An example properties file might be:

+

+ +<?xml version="1.0" encoding="UTF-8" ?> +<configuration> + <property name="org.apache.manifoldcf.synchdirectory" value="c:/mysynchdir"/> + <property name="org.apache.manifoldcf.logconfigfile" value="c:/conf/logging.ini"/> + <libdir path="./lib"/> +</configuration> + +

+

An example simple logging configuration file might be:

+

+ +# Set the default log level and parameters +# This gets inherited by all child loggers +log4j.rootLogger=WARN, MAIN + +log4j.additivity.org.apache=false + +log4j.appender.MAIN=org.apache.log4j.RollingFileAppender +log4j.appender.MAIN.File=c:/dataarea/manifoldcf.log +log4j.appender.MAIN.MaxFileSize=50MB +log4j.appender.MAIN.MaxBackupIndex=10 +log4j.appender.MAIN.layout=org.apache.log4j.PatternLayout +log4j.appender.MAIN.layout.ConversionPattern=[%d]%-5p %m%n + +

+

+

+
+
+ Commands +

+

After you have created the necessary configuration files, you will need to initialize the database, register the "pull-agent" agent, and then register your individual connectors. ManifoldCF provides a set of commands for performing these actions, and others as well. The classes implementing these commands are specified below.

+

+ + + + + +
Core Command ClassArgumentsFunction
org.apache.manifoldcf.core.DBCreatedbuser [dbpassword]Create ManifoldCF database instance
org.apache.manifoldcf.core.DBDropdbuser [dbpassword]Drop ManifoldCF database instance
org.apache.manifoldcf.core.LockCleanNoneClean out synchronization directory
+

+ + + + + + + + + + + + + + +
Agents Command ClassArgumentsFunction
org.apache.manifoldcf.agents.InstallNoneCreate ManifoldCF agents tables
org.apache.manifoldcf.agents.UninstallNoneRemove ManifoldCF agents tables
org.apache.manifoldcf.agents.RegisterclassnameRegister an agent class
org.apache.manifoldcf.agents.UnRegisterclassnameUn-register an agent class
org.apache.manifoldcf.agents.UnRegisterAllNoneUn-register all current agent classes
org.apache.manifoldcf.agents.SynchronizeAllNoneUn-register all registered agent classes that can't be found
org.apache.manifoldcf.agents.RegisterOutputclassname descriptionRegister an output connector class
org.apache.manifoldcf.agents.UnRegisterOutputclassnameUn-register an output connector class
org.apache.manifoldcf.agents.UnRegisterAllOutputsNoneUn-register all current output connector classes
org.apache.manifoldcf.agents.SynchronizeOutputsNoneUn-register all registered output connector classes that can't be found
org.apache.manifoldcf.agents.AgentRunNoneMain agents process class
org.apache.manifoldcf.agents.AgentStopNoneStops the running agents process
+

+ + + + + + + + +
Crawler Command ClassArgumentsFunction
org.apache.manifoldcf.crawler.Registerclassname descriptionRegister a repository connector class
org.apache.manifoldcf.crawler.UnRegisterclassnameUn-register a repository connector class
org.apache.manifoldcf.crawler.UnRegisterAllNoneUn-register all repository connector classes
org.apache.manifoldcf.crawler.SynchronizeConnectorsNoneUn-register all registered repository connector classes that can't be found
org.apache.manifoldcf.crawler.ExportConfigurationfilenameExport crawler configuration to a file
org.apache.manifoldcf.crawler.ImportConfigurationfilenameImport crawler configuration from a file
+

+ + + + + + +
Authority Command ClassArgumentsFunction
org.apache.manifoldcf.authorities.RegisterAuthorityclassname descriptionRegister an authority connector class
org.apache.manifoldcf.authorities.UnRegisterAuthorityclassnameUn-register an authority connector class
org.apache.manifoldcf.authorities.UnRegisterAllAuthoritiesNoneUn-register all authority connector classes
org.apache.manifoldcf.authorities.SynchronizeAuthoritiesNoneUn-register all registered authority connector classes that can't be found
+

+

Remember that you need to include all the jars under module/dist/processes in the classpath whenever you run one of these commands! You also must include the corresponding -D switches, as described earlier.

+

+
+
+ Initializing the database +

+

These are some of the commands you will need to use to create the database instance, initialize the schema, and register all of the appropriate components:

+

+ + + + + + + + + + + + + + + + + + + + + + + + +
CommandArguments
org.apache.manifoldcf.core.DBCreatepostgres postgres
org.apache.manifoldcf.agents.Install
org.apache.manifoldcf.agents.Registerorg.apache.manifoldcf.crawler.system.CrawlerAgent
org.apache.manifoldcf.agents.RegisterOutputorg.apache.manifoldcf.agents.output.gts.GTSConnector "GTS Connector"
org.apache.manifoldcf.agents.RegisterOutputorg.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
org.apache.manifoldcf.agents.RegisterOutputorg.apache.manifoldcf.agents.output.nullconnector.NullConnector "Null Connector"
org.apache.manifoldcf.authorities.RegisterAuthorityorg.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority "Active Directory Authority"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.DCTM.DCTM "Documentum Connector"
org.apache.manifoldcf.authorities.RegisterAuthorityorg.apache.manifoldcf.crawler.authorities.DCTM.AuthorityConnector "Documentum Authority"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.filenet.FilenetConnector "FileNet Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.filesystem.FileConnector "Filesystem Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector "Windows Share Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.livelink.LivelinkConnector "LiveLink Connector"
org.apache.manifoldcf.authorities.RegisterAuthorityorg.apache.manifoldcf.crawler.connectors.livelink.LivelinkAuthority "LiveLink Authority"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.memex.MemexConnector "Memex Connector"
org.apache.manifoldcf.authorities.RegisterAuthorityorg.apache.manifoldcf.crawler.connectors.memex.MemexAuthority "Memex Authority"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.meridio.MeridioConnector "Meridio Connector"
org.apache.manifoldcf.authorities.RegisterAuthorityorg.apache.manifoldcf.crawler.connectors.meridio.MemexAuthority "Meridio Authority"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository "SharePoint Connector"
org.apache.manifoldcf.crawler.Registerorg.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector "Web Connector"
+

+
+
+ Deploying the <strong>mcf-crawler-ui</strong>, <strong>mcf-authority-service</strong>, and <strong>mcf-api-service</strong> web applications +

+

If you built ManifoldCF using ant under the modules directory, then the ant build will have constructed three war files for you under modules/dist/web. Take these war files and deploy them as web applications under one or more instances of your application server. There is no requirement that the mcf-crawler-ui, mcf-authority-service, and mcf-api-service web applications be deployed on the same instance of the application server. With the current architecture of ManifoldCF, they must be deployed on the same physical server, however.

+

+

Under modules/dist/web, you may also see files that are not war files. These files are meant to be used as command-line -D switches for the application server process. The switches may or may not be identical for the two web applications, but they will never conflict. You may need to alter environment variables or your application server startup scripts in order to provide these switches. (More about this in the future...)

+

+
+
+ Running the <strong>agents</strong> process +

+

The agents process is the process that actually performs the crawling for ManifoldCF. Start this process by running the command "org.apache.manifoldcf.agents.AgentRun". This class will run until stopped by invoking the command "org.apache.manifoldcf.agents.AgentStop". It is highly recommended that you stop the process in this way. You may also stop the process using a SIGTERM signal, but "kill -9" or the equivalent is NOT recommended, because that may result in dangling locks in the ManifoldCF synchronization directory. (If you have to, clean up these locks by shutting down all ManifoldCF processes, including the application server instances that are running the web applications, and invoking the command "org.apache.manifoldcf.core.LockClean".)

+

+
+
+ Running connector-specific processes +

+

Connector-specific processes require the classpath for their invocation to include all the jars that are in the corresponding modules/dist/<process_name>-process directory. The Documentum and FileNet connectors are the only two connectors that currently require additional processes. Start these processes using the commands listed below, and stop them with SIGTERM.

+

+ + + + + + +
ConnectorProcessStart class
Documentumdocumentum-server-processorg.apache.manifoldcf.crawler.server.DCTM.DCTM
Documentumdocumentum-registry-processorg.apache.manifoldcf.crawler.registry.DCTM.DCTM
FileNetfilenet-server-processorg.apache.manifoldcf.crawler.server.filenet.Filenet
FileNetfilenet-registry-processorg.apache.manifoldcf.crawler.registry.filenet.Filenet
+

+
+
+ +
+ Running the ManifoldCF Apache2 plug in +

+

The ManifoldCF Apache2 plugin, mod-authz-annotate, is designed to convert an authenticated principle (e.g. from mod-auth-kerb), and query a set of authority services for access tokens using an HTTP request. These access tokens are then passed to a (not included) search engine UI, which can use them to help compose a search that properly excludes content that the user is not supposed to see.

+

+

The list of authority services so queried is configured in Apache's httpd.conf file. This project includes only one such service: the java authority service, which uses authority connections defined in the crawler UI to obtain appropriate access tokens.

+

+

In order for mod-authz-annotate to be used, it must be placed into Apache2's extensions directory, and configured appropriately in the httpd.conf file.

+

+

Note: The ManifoldCF project now contains support for converting a Kerberos principal to a list of Active Directory SIDs. This functionality is contained in the Active Directory Authority. The following connectors are expected to make use of this authority:

+

+
    +
  • FileNet
  • +
  • Meridio
  • +
  • SharePoint
  • +
+

+
+ Configuring the ManifoldCF Apache2 plug in +

+

mod-authz-annotate understands the following httpd.conf commands:

+

+ + + + + + + +
CommandMeaningValues
AuthzAnnotateEnableTurn on/off the plugin"On", "Off"
AuthzAnnotateAuthorityPoint to an authority service that supports ACL queries, but not ID queriesThe authority URL
AuthzAnnotateACLAuthorityPoint to an authority service that supports ACL queries, but not ID queriesThe authority URL
AuthzAnnotateIDAuthorityPoint to an authority service that supports ID queries, but not ACL queriesThe authority URL
AuthzAnnotateIDACLAuthorityPoint to an authority service that supports both ACL queries and ID queriesThe authority URL
+
+
+
+ + +
+ + + + + + + + + + Propchange: incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml ------------------------------------------------------------------------------ svn:eol-style = native Propchange: incubator/lcf/site/src/documentation/content/xdocs/how-to-build-and-deploy.xml ------------------------------------------------------------------------------ svn:keywords = Id