Return-Path: Delivered-To: apmail-incubator-connectors-commits-archive@minotaur.apache.org Received: (qmail 30540 invoked from network); 20 Feb 2010 08:18:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Feb 2010 08:18:28 -0000 Received: (qmail 32386 invoked by uid 500); 20 Feb 2010 08:18:28 -0000 Delivered-To: apmail-incubator-connectors-commits-archive@incubator.apache.org Received: (qmail 32340 invoked by uid 500); 20 Feb 2010 08:18:28 -0000 Mailing-List: contact connectors-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-dev@incubator.apache.org Delivered-To: mailing list connectors-commits@incubator.apache.org Received: (qmail 32326 invoked by uid 99); 20 Feb 2010 08:18:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Feb 2010 08:18:28 +0000 X-ASF-Spam-Status: No, hits=-1999.2 required=10.0 tests=ALL_TRUSTED,FUZZY_MERIDIA X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Feb 2010 08:18:20 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 1BA1029A001E for ; Sat, 20 Feb 2010 00:18:00 -0800 (PST) Date: Sat, 20 Feb 2010 08:18:00 +0000 (UTC) From: confluence@apache.org To: connectors-commits@incubator.apache.org Message-ID: <1927676392.1167.1266653880112.JavaMail.www-data@brutus.apache.org> Subject: [CONF] Lucene Connector Framework > How to Build and Deploy Lucene Connector Framework MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated Space: Lucene Connector Framework (http://cwiki.apache.org/confluence/display/CONNECTORS) Page: How to Build and Deploy Lucene Connector Framework (http://cwiki.apache.org/confluence/display/CONNECTORS/How+to+Build+and+Deploy+Lucene+Connector+Framework) Edited by Karl Wright: --------------------------------------------------------------------- h1. Building LCF To build Lucene Connector Framework, and the particular connectors you are interested in, you currently need to do the following: # Check out [https://svn.apache.org/repos/asf/incubator/lcf/trunk]. # cd to "modules". # Install desired dependent LGPL and proprietary libraries, wsdls, and xsds. See below for details. # Run ant. If you supply *no* LGPL or proprietary libraries, the framework itself and only the following repository connectors will be built: * Filesystem connector * JDBC connector, with just the postgresql jdbc driver * RSS connector * Webcrawler connector In addition, the following output connectors will be built: * MetaCarta GTS output connector * Lucene SOLR output connector * Null output connector The LGPL and proprietary connector dependencies are described below. h2. Building the Documentum connector The Documentum connector requires EMC's DFC product in order to be built. Install DFC on the build system, and locate the jars it installs. You will need to copy at least dfc.jar, dfcbase.jar, and dctm.jar into the directory "modules/connectors/documentum/dfc". h2. Building the FileNet connector The FileNet connector requires IBM's FileNet P8 API jar in order to be build. Install the FileNet P8 API on the build system, and copy at least "Jace.jar" from that installation into "modules/connectors/filenet/filenet-api". h2. Building the JDBC connector, including Oracle, SQLServer, or Sybase JDBC drivers The JDBC connector also knows how to work with Oracle, SQLServer, and Sybase JDBC drivers. For Oracle, download the appropriate Oracle JDBC jar from the Oracle site, and copy it into the directory "modules/connectors/jdbc/jdbc-drivers". For SQLServer and Sybase, download jtds.jar, and copy it into the same directory. h2. Building the jCIFS connector To build this connector, you need to download jcifs.jar from http://samba.jcifs.org, and copy it into the "modules/connectors/jcifs/jcifs" directory. h2. Building the LiveLink connector This connector needs LAPI, which is a proprietary java library that allows access to OpenText's LiveLink server. Copy the lapi.jar into the "modules/connectors/livelink/lapi" directory. h2. Building the Memex connector This connector needs the Memex API jar, usually called JavaMXIELIB.jar. Copy this jar into the "modules/connectors/memex/mxie-java" directory. h2. Building the Meridio connector The Meridio connector needs wsdls and xsds downloaded from an installed Meridio instance using *disco.exe*, which is installed as part of Microsoft Visual Studio, typically under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". Obtain the preliminary wsdls and xsds by interrogating the following Meridio web services: * http\[s\]:///DMWS/MeridioDMWS.asmx * http\[s\]:///RMWS/MeridioRMWS.asmx You should have obtained the following files in this step: * MeridioDMWS.wsdl * MeridioRMWS.wsdl * DMDataSet.xsd * RMDataSet.xsd * RMClassificationDataSet.xsd Next, patch these using Microsoft's *xmldiffpatch* utility suite, downloadable for Windows from [http://msdn.microsoft.com/en-us/library/aa302294.aspx]. The appropriate diff files to apply as patches can be found in "modules/connectors/meridio/upstream-diffs". After the patching, rename so that you have the files: * MeridioDMWS_axis.wsdl * MeridioRMWS_axis.wsdl * DMDataSet.xsd * RMDataSet.xsd * RMClassificationDataSet.xsd Finally, copy all of these to: "modules/connectors/meridio/wsdls". h2. Building the SharePoint connector In order to build this connector, you need to download wsdls from an installed SharePoint instance. The wsdls in question are: * Permissions.wsdl * Lists.wsdl * Dspsts.wsdl * usergroup.wsdl * versions.wsdl * webs.wsdl To download a wsdl, use Microsoft's *disco.exe* tool, which is part of Visual Studio, typically under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". You'd want to interrogate the following urls: * http\[s\]:///_vti_bin/Permissions.asmx * http\[s\]:///_vti_bin/Lists.asmx * http\[s\]:///_vti_bin/Dspsts.asmx * http\[s\]:///_vti_bin/usergroup.asmx * http\[s\]:///_vti_bin/versions.asmx * http\[s\]:///_vti_bin/webs.asmx When the wsdl files have been downloaded, copy them to: "modules/connectors/sharepoint/wsdls". h1. Running Lucene Connector Framework The core part of Lucene Connector Framework consists of several pieces. These basic pieces are enumerated below: * A Postgresql database, which is where LCF keeps all of its configuration and state information * A synchronization directory, which how LCF coordinates activity among its various processes * An *agents* process, which is the process that actually crawls documents and ingests them * A *crawler-ui* web application, which presents the UI users interact with to configure and control the crawler * An *authority service* web application, which responds to requests for authorization tokens, given a user name In addition, there are a number of java classes in Lucene Connector Framework that are intended to be called directly, to perform specific actions in the environment or in the database. These classes are usually invoked from the command line, with appropriate arguments supplied. Basic functionality supplied by these classes are as follows: * Create/Destroy the LCF database instance * Start/Stop the *agents* process * Register/Unregister an agent class (there's currently only one included) * Register/Unregister an output connector * Register/Unregister a repository connector * Register/Unregister an authority connector * Clean up synchronization directory garbage resulting from an ungraceful interruption of an LCF process * Query for certain kinds of job-related information Individual connectors may contribute additional command classes and processes to this picture. A properly built connector typically consists of: * One or more jar files meant to be included in the *agents* process and command invocation classpaths * An "iar" incremental war file, which is meant to be unpacked on top of the *crawler-ui* web application * Possibly some java commands, which are meant to support or configure the connector in some way. * Possibly a connector-specific process or two, each requiring a distinct classpath, which usually serves to isolate the *crawler-ui* web application, *authority service* web application, *agents* process, and any commands from problematic aspects of the client environment * A recommended set of java "define" variables, which should be used consistently with all involved processes, e.g. the *agents* process, the application server running the *authority service* and *crawler-ui*, and any commands. A connector package will typically supply an output connector, or a repository connector, or both a repository connector and an authority connector. h2. Configuring the Postgresql database Despite having an internal architecture that cleanly abstracts from specific database details, Lucene Connector Framework is currently fairly specific to Postgresql at this time. There are a number of reasons for this. # Lucene Connector Framework uses the database for its document queue, which places a significant load on it. The back-end database is thus a significant factor in LCF's performance. But, in exchange, LCF benefits enormously from the underlying ACID properties of the database. # The syntax abstraction is not perfect. Some details, such as how regular expressions are handled, have not been abstracted sufficiently at the time of this writing. # The strategy for getting optimal query plans from the database is not abstracted. For example, Postgresql 8.3+ is very sensitive to certain statistics about a database table, and will not generate a performant plan if the statistics are inaccurate by even a little, in some cases. So, for Postgresql, the database table must be analyzed very frequently, to avoid catastrophically bad plans. But luckily, Postgresql is pretty good at doing analysis quickly. Oracle, on the other hand, takes a very long time to perform analysis, but its plans are much less sensitive. # Postgresql always does a sequential scan in order to count the number of rows in a table, while other databases return this efficiently. This has affected the design of the LCF UI. # The choice of query form influences the query plan. Ideally, this is not true, but for both Postgresql and for (say) Oracle, it is. # Postgresql has a high degree of parallelism and lack of internal single-threadedness. Lucene Connector Framework has been tested against Postgresql 8.3.7. We recommend the following configuration parameter settings to work optimally with LCF: * A default database encoding of UTF-8 * postgresql.conf settings as described in the table below * pg_hba.conf settings to allow password access for TCP/IP connections from Lucene Connector Framework * A maintenance strategy involving cronjob-style vacuuming, rather than Postgresql autovacuum || postgresql.conf parameter || Tested value || | shared_buffers | 1024MB | | checkpoint_segments | 300 | | maintenance_work_mem | 2MB | | tcpip_socket | true | | max_connections | 400 | | checkpoint_timeout | 900 | | datastyle | ISO,European | | autovacuum | off | h3. A note about maintenance Postgresql's architecture causes it to accumulate dead tuples in its data files, which do not interfere with its performance but do bloat the database over time. The usage pattern of LCF is such that it can cause significant bloat to occur to the underlying Postgresql database in only a few days, under sufficient load. Postgresql has a feature to address this bloat, called *vacuuming*. This comes in three varieties: autovacuum, manual vacuum, and manual full vacuum. We have found that Postgresql's autovacuum feature is inadequate under such conditions, because it not only fights for database resources pretty much all the time, but it falls further and further behind as well. Postgresql's in-place manual vacuum functionality is a bit better, but is still much, much slower than actually making a new copy of the database files, which is what happens when a manual full vacuum is performed. Dead-tuple bloat also occurs in indexes in Postgresql, so tables that have had a lot of activity may benefit from being reindexed at the time of maintenance. We therefore recommend periodic, scheduled maintenance operations instead, consisting of the following: * VACUUM FULL VERBOSE; * REINDEX DATABASE ; During maintenance, Postgresql locks tables one at a time. Nevertheless, the crawler ui may become unresponsive for some operations, such as when counting outstanding documents on the job status page. LCF thus has the ability to check for the existence of a file prior to such sensitive operations, and will display a useful "maintenance in progress" message if that file is found. This allows a user to set up a maintenance system that provides adequate feedback for an LCF user of the overall status of the system. h2. Running the *agents* process to be continued h2. Deploying the *crawler-ui* war to be continued h2. Deploying the *authorityservice* war to be continued h2. Running commands || Core Command Class || Function || | org.apache.lcf.core.DBCreate | Create LCF database instance | | org.apache.lcf.core.DBDrop | Drop LCF database instance | | org.apache.lcf.core.LockClean | Clean out synchronization directory | || Agents Command Class || Function || | org.apache.lcf.agents.Install | Create LCF agents tables | | org.apache.lcf.agents.Uninstall | Remove LCF agents tables | | org.apache.lcf.agents.Register | Register an agent class | | org.apache.lcf.agents.UnRegister | Un-register an agent class | | org.apache.lcf.agents.UnRegisterAll | Un-register all current agent classes | | org.apache.lcf.agents.SynchronizeAll | Un-register all registered agent classes that can't be found | | org.apache.lcf.agents.RegisterOutput | Register an output connector class | | org.apache.lcf.agents.UnRegisterOutput | Un-register an output connector class | | org.apache.lcf.agents.UnRegisterAllOutputs | Un-register all current output connector classes | | org.apache.lcf.agents.SynchronizeOutputs | Un-register all registered output connector classes that can't be found | | org.apache.lcf.agents.AgentRun | Main *agents* process class | | org.apache.lcf.agents.AgentStop | Stops the running *agents* process | || Crawler Command Class || Function || | org.apache.lcf.crawler.Register | Register a repository connector class | | org.apache.lcf.crawler.UnRegister | Un-register a repository connector class | | org.apache.lcf.crawler.UnRegisterAll | Un-register all repository connector classes | | org.apache.lcf.crawler.SynchronizeConnectors | Un-register all registered repository connector classes that can't be found | | org.apache.lcf.crawler.ExportConfiguration | Export crawler configuration to a file | | org.apache.lcf.crawler.ImportConfiguration | Import crawler configuration from a file | || Authority Command Class || Function || | org.apache.lcf.authorities.RegisterAuthority | Register an authority connector class | | org.apache.lcf.authorities.UnRegisterAuthority | Un-register an authority connector class | | org.apache.lcf.authorities.UnRegisterAllAuthorities | Un-register all authority connector classes | | org.apache.lcf.authorities.SynchronizeAuthorities | Un-register all registered authority connector classes that can't be found | to be continued Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action