Return-Path: Delivered-To: apmail-incubator-connectors-commits-archive@minotaur.apache.org Received: (qmail 70163 invoked from network); 4 Oct 2010 19:02:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Oct 2010 19:02:22 -0000 Received: (qmail 83799 invoked by uid 500); 4 Oct 2010 19:02:22 -0000 Delivered-To: apmail-incubator-connectors-commits-archive@incubator.apache.org Received: (qmail 83752 invoked by uid 500); 4 Oct 2010 19:02:21 -0000 Mailing-List: contact connectors-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-dev@incubator.apache.org Delivered-To: mailing list connectors-commits@incubator.apache.org Received: (qmail 83745 invoked by uid 99); 4 Oct 2010 19:02:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Oct 2010 19:02:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Oct 2010 19:02:20 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o94J20N9029839 for ; Mon, 4 Oct 2010 19:02:00 GMT Date: Mon, 4 Oct 2010 15:02:00 -0400 (EDT) From: confluence@apache.org To: connectors-commits@incubator.apache.org Message-ID: <11377506.21133.1286218920014.JavaMail.confluence@thor> Subject: [CONF] Apache Connectors Framework > Programmatic Operation of ManifoldCF MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Auto-Submitted: auto-generated Space: Apache Connectors Framework (https://cwiki.apache.org/confluence/dis= play/CONNECTORS) Page: Programmatic Operation of ManifoldCF (https://cwiki.apache.org/conflu= ence/display/CONNECTORS/Programmatic+Operation+of+ManifoldCF) Edited by Karl Wright: --------------------------------------------------------------------- h1. Programmatic Operation of ManifoldCF A certain subset of ManifoldCF users want to think of ManifoldCF as an engi= ne that they can poke from whatever other system they are developing. Whil= e ManifoldCF is not precisely a document indexing engine per se, it can cer= tainly be controlled programmatically. Right now, there are three principl= e ways of achieving this control. h3. Control by Servlet API ManifoldCF provides a servlet-based JSON API that gives you the complete ab= ility to define connections and jobs, and control job execution. You can r= ead about JSON [here|http://www.json.org]. The API is designed to be RESTf= ul in character. Thus, it makes full use of the HTTP verbs GET, PUT, POST,= and DELETE, and represents objects as URLs. The basic format of the JSON = servlet resource URLs is as follows: http\[s\]://__/mcf-api-service/json/__ The servlet ignores request data, except when the PUT or POST verb is used.= In that case, the request data is presumed to be a JSON object. The serv= let responds either with an error response code (either 400 or 500) with an= appropriate explanatory message, or with a 200 (OK), 201 (CREATED), or 404= (NOT FOUND) response code along with a response JSON object. The actual available resources and commands are as follows: || Resource || Verb || What it does || Input format || Output format || | outputconnectors | GET | List all registered output connectors | N/A | \{= "outputconnector":\[__\]\} *OR* \{"error"= :__\} | | authorityconnectors | GET | List all registered authority connectors | N/= A | \{"authorityconnector":\[__\]\} *O= R* \{"error":__\} | | repositoryconnectors | GET | List all registered repository connectors | = N/A | \{"repositoryconnector":\[__\]\= } *OR* \{"error":__\} | | outputconnections | GET | List all output connections | N/A | \{"outputco= nnection":\[__\]\} *OR* \{"error":__\} | | outputconnections/__ | GET | Get a specific outp= ut connection | N/A | \{"outputconnection":__\} *= OR* \{ \} *OR* \{"error":__\} | | outputconnections/__ | PUT | Save or create an o= utput connection | \{"outputconnection":__\} | \{= "connection_name":__\} *OR* \{"error":__\} | | outputconnections/__ | DELETE | Delete an output= connection | N/A | \{ \} *OR* \{"error":__\} | | status/outputconnections/__ | GET | Check the st= atus of an output connection | N/A | \{"check_result":__\} *OR* \{= "error":__\} | | info/outputconnections/__/__ | GET | Retrieve arbitrary connector-specific resource | N/A | _<= response_data>_ *OR* \{"error":__\} *OR* \{"service_interruptio= n":__\} | | authorityconnections | GET | List all authority connections | N/A | \{"au= thorityconnection":\[__\]\} *OR* \{"e= rror":__\} | | authorityconnections/__ | GET | Get a specific a= uthority connection | N/A | \{"authorityconnection":__\} *OR* \{ \} *OR* \{"error":__\} | | authorityconnections/__ | PUT | Save or create a= n authority connection | \{"authorityconnection":__\} | \{"connection_name":__\} *OR* \{"error":__\} | | authorityconnections/__ | DELETE | Delete an aut= hority connection | N/A | \{ \} *OR* \{"error":__\} | | status/authorityconnections/__ | GET | Check the= status of an authority connection | N/A | \{"check_result":__\} *= OR* \{"error":__\} | | repositoryconnections | GET | List all repository connections | N/A | \{"= repositoryconnection":\[__\]\} *OR* = \{"error":__\} | | repositoryconnections/__ | GET | Get a specific = repository connection | N/A | \{"repositoryconnection":__\} *OR* \{ \} *OR* \{"error":__\} | | repositoryconnections/__ | PUT | Save or create = a repository connection | \{"repositoryconnection":__\} | \{"connection_name":__\} *OR* \{"error":__\} | | repositoryconnections/__ | DELETE | Delete a rep= ository connection | N/A | \{ \} *OR* \{"error":__\} | | status/repositoryconnections/__ | GET | Check th= e status of a repository connection | N/A | \{"check_result":__\} = *OR* \{"error":__\} | | info/repositoryconnections/__/__ | GET | Retrieve arbitrary connector-specific resource | N/A = | __ *OR* \{"error":__\} *OR* \{"service_interru= ption":__\} | | jobs | GET | List all job definitions | N/A | \{"job":\[__\]\} *OR* \{"error":__\} | | jobs | POST | Create a job | \{"job":__\} | \{"job_id":__\} *OR* \{"error":__\} | | jobs/__ | GET | Get a specific job definition | N/A | \{"job":_\} *OR* \{ \} *OR* \{"error":__\} | | jobs/__ | PUT | Save a job definition | \{"job":__\} = | \{"job_id":__\} *OR* \{"error":__\} | | jobs/__ | DELETE | Delete a job definition | N/A | \{ \} *OR* \{"= error":__\} | | jobstatuses | GET | List all jobs and their status | N/A | \{"job":\[__\]\} *OR* \{"error":__\} | | jobstatuses/__ | GET | Get a specific job's status | N/A | \{"job= status":_\} *OR* \{ \} *OR* \{"error":__\} = | | start/__ | PUT | Start a specified job manually | N/A | \{ \} *OR= * \{"error":__\} | | abort/__ | PUT | Abort a specified job | N/A | \{ \} *OR* \{"erro= r":__\} | | restart/__ | PUT | Stop and start a specified job | N/A | \{ \} *= OR* \{"error":__\} | | pause/__ | PUT | Pause a specified job | N/A | \{ \} *OR* \{"erro= r":__\} | | resume/__ | PUT | Resume a specified job | N/A | \{ \} *OR* \{"er= ror":__\} | Other resources having to do with reports have been planned, but not yet be= en implemented. h5. Output connector objects The JSON fields an output connector object has are as follows: || Field || Meaning || | "description" | The optional description of the connector | | "class_name" | The class name of the class implementing the connector | h5. Authority connector objects The JSON fields an authority connector object has are as follows: || Field || Meaning || | "description" | The optional description of the connector | | "class_name" | The class name of the class implementing the connector | h5. Repository connector objects The JSON fields a repository connector object has are as follows: || Field || Meaning || | "description" | The optional description of the connector | | "class_name" | The class name of the class implementing the connector | h5. Output connection objects Output connection names, when they are part of a URL, should be encoded as = follows: # All instances of '.' should be replaced by '..'. # All instances of '/' should be replaced by '.+'. # The URL should be encoded using standard URL utf-8-based %-encoding. The JSON fields an output connection object has are as follows: || Field || Meaning || | "name" | The unique name of the connection | | "description" | The description of the connection | | "class_name" | The java class name of the class implementing the connecti= on | | "max_connections" | The total number of outstanding connections allowed t= o exist at a time | | "configuration" | The configuration object for the connection, which is s= pecific to the connection class | h5. Authority connection objects Authority connection names, when they are part of a URL, should be encoded = as follows: # All instances of '.' should be replaced by '..'. # All instances of '/' should be replaced by '.+'. # The URL should be encoded using standard URL utf-8-based %-encoding. The JSON fields for an authority connection object are as follows: || Field || Meaning || | "name" | The unique name of the connection | | "description" | The description of the connection | | "class_name" | The java class name of the class implementing the connecti= on | | "max_connections" | The total number of outstanding connections allowed t= o exist at a time | | "configuration" | The configuration object for the connection, which is s= pecific to the connection class | h5. Repository connection objects Repository connection names, when they are part of a URL, should be encoded= as follows: # All instances of '.' should be replaced by '..'. # All instances of '/' should be replaced by '.+'. # The URL should be encoded using standard URL utf-8-based %-encoding. The JSON fields for a repository connection object are as follows: || Field || Meaning || | "name" | The unique name of the connection | | "description" | The description of the connection | | "class_name" | The java class name of the class implementing the connecti= on | | "max_connections" | The total number of outstanding connections allowed t= o exist at a time | | "configuration" | The configuration object for the connection, which is s= pecific to the connection class | | "acl_authority" | The (optional) name of the authority that will enforce = security for this connection | | "throttle" | An array of throttle objects, which control how quickly docu= ments can be requested from this connection | Each throttle object has the following fields: || Field || Meaning || | "match" | The regular expression which is used to match a document's bins= to determine if the throttle should be applied | | "match_description" | Optional text describing the meaning of the throttl= e | | "rate" | The maximum fetch rate to use if the throttle applies, in fetche= s per minute | h5. Job objects The JSON fields for a job are is as follows: || Field || Meaning || | "id" | The job's identifier, if present. If not present, ManifoldCF will= create one (and will also create the job when saved). | | "description" | Text describing the job | | "repository_connection" | The name of the repository connection to use wi= th the job | | "output_connection" | The name of the output connection to use with the j= ob | | "document_specification" | The document specification object for the job,= whose format is repository-connection specific | | "output_specification" | The output specification object for the job, who= se format is output-connection specific | | "start_mode" | The start mode for the job, which can be one of "schedule = window start", "schedule window anytime", or "manual" | | "run_mode" | The run mode for the job, which can be either "continuous" o= r "scan once" | | "hopcount_mode" | The hopcount mode for the job, which can be either "acc= urate", "no delete", "never delete" | | "priority" | The job's priority, typically "5" | | "recrawl_interval" | The default time between recrawl of documents (if th= e job is "continuous"), in milliseconds, or "infinite" for infinity | | "expiration_interval" | The time until a document expires (if the job is = "continuous"), in milliseconds, or "infinite" for infinity | | "reseed_interval" | The time between reseeding operations (if the job is = "continuous"), in milliseconds, or "infinite" for infinity | | "hopcount" | An array of hopcount objects, describing the link types and = associated maximum hops permitted for the job | | "schedule" | An array of schedule objects, describing when the job should= be started and run | Each hopcount object has the following fields: || Field || Meaning || | "link_type" | The connection-type-dependent type of a link for which a ho= p count restriction is specified | | "count" | The maximum number of hops allowed for the associated link type= , starting at a seed | Each schedule object has the following fields: || Field || Meaning || | "timezone" | The optional time zone for the schedule object; if not prese= nt the default server time zone is used | | "duration" | The optional length of the described time window, in millise= conds; if not present, duration is considered infinite | | "dayofweek" | The optional day-of-the-week enumeration object | | "monthofyear" | The optional month-of-the-year enumeration object | | "dayofmonth" | The optional day-of-the-month enumeration object | | "year" | The optional year enumeration object | | "hourofday" | The optional hour-of-the-day enumeration object | | "minutesofhour" | The optional minutes-of-the-hour enumeration object | Each enumeration object describes an array of integers using the form: \{"value":\[__\]\} Each integer is a zero-based index describing which entity is being specifi= ed. For example, for "dayofweek", 0 corresponds to Sunday, etc., and thus = "dayofweek":\{"value":\[0,6\]\} would describe Saturdays and Sundays. h5. Job status objects The JSON fields of a job status object are as follows: || Field || Meaning || | "job_id" | The job identifier | | "status" | The job status, having the possible values: "not yet run", "ru= nning", "paused", "done", "waiting", "starting up", "cleaning up", "error",= "aborting", "restarting", "running no connector", and "terminating" | | "error_text" | The error text, if the status is "error" | | "start_time" | The job start time, in milliseconds since Jan 1, 1970 | | "end_time" | The job end time, in milliseconds since Jan 1, 1970 | | "documents_in_queue" | The total number of documents in the queue for the= job | | "documents_outstanding" | The number of documents for the job that are cu= rrently considered 'active' | | "documents_processed" | The number of documents that in the queue for the= job that have been processed at least once | h5. Connection-type-specific objects As you may note when trying to use the above JSON API methods, you cannot g= et very far in defining connections or jobs without knowing the JSON format= of a connection's configuration information, or a job's connection-specifi= c document specification and output specification information. The form of= these objects is controlled by the Java implementation of the underlying c= onnector, and is translated directly into JSON, so if you write your own co= nnector you should be able to figure out what it will be in the API. For c= onnectors already part of ManifoldCF, it remains an ongoing task to documen= t these connector-specific objects. This task is not yet underway. Luckily, it is pretty easy to learn a lot about the objects in question by = simply creating connections and jobs in the ManifoldCF crawler UI, and then= inspecting the resulting JSON objects through the API. In this way, it sh= ould be possible to do a decent job of coding most API-based integrations. = The one place where difficulties will certainly occur will be if you try t= o completely replace the ManifoldCF crawler UI with one of your own. This = is because most connectors have methods that communicate with their respect= ive back-ends in order to allow the user to select appropriate values. For= example, the path drill-down that is presented by the LiveLink connector r= equires that the connector interrogate the appropriate LiveLink repository = in order to populate its path selection pull-downs. There is, at this time= , only one sanctioned way to accomplish the same job using the API, which i= s to use the appropriate "_connection_type_/execute/_type-specific_command_= " command to perform the necessary functions. Some set of useful functions= has been coded for every appropriate connector, but the exact commands for= every connector, and their JSON syntax, remains undocumented for now. h5. File system connector The file system connector has no configuration information, and no connecto= r-specific commands. However, it does have document specification informat= ion. The information looks something like this: \{"startpoint":\[\{"_attribute_path":"c:\path_to_files","include":\[\{"_att= ribute_type":"file","_attribute_match":"\*.txt"\},\{"_attribute_type":"file= ","_attribute_match":"\*.doc"\,"_attribute_type":"directory","_attribute_ma= tch":"\*"\],"exclude":\["*.mov"\]\]\} As you can see, multiple starting paths are possible, and the inclusion and= exclusion rules also can be one or multiple. h3. Control via Commands For script writers, there currently exist a number of ManifoldCF execution = commands. These commands are primarily rich in the area of definition of c= onnections and jobs, controlling jobs, and running reports. The following = table lists the current suite. || Command || What it does || | org.apache.manifoldcf.agents.DefineOutputConnection | Create a new output= connection | | org.apache.manifoldcf.agents.DeleteOutputConnection | Delete an existing = output connection | | org.apache.manifoldcf.authorities.ChangeAuthSpec | Modify an authority's = configuration information | | org.apache.manifoldcf.authorities.CheckAll | Check all authorities to be = sure they are functioning | | org.apache.manifoldcf.authorities.DefineAuthorityConnection | Create a ne= w authority connection | | org.apache.manifoldcf.authorities.DeleteAuthorityConnection | Delete an e= xisting authority connection | | org.apache.manifoldcf.crawler.AbortJob | Abort a running job | | org.apache.manifoldcf.crawler.AddScheduledTime | Add a schedule record to= a job | | org.apache.manifoldcf.crawler.ChangeJobDocSpec | Modify a job's specifica= tion information | | org.apache.manifoldcf.crawler.DefineJob | Create a new job | | org.apache.manifoldcf.crawler.DefineRepositoryConnection | Create a new r= epository connection | | org.apache.manifoldcf.crawler.DeleteJob | Delete an existing job | | org.apache.manifoldcf.crawler.DeleteRepositoryConnection | Delete an exis= ting repository connection | | org.apache.manifoldcf.crawler.ExportConfiguration | Write the complete li= st of all connection definitions and job specifications to a file | | org.apache.manifoldcf.crawler.FindJob | Locate a job identifier given a j= ob's name | | org.apache.manifoldcf.crawler.GetJobSchedule | Find a job's schedule give= n a job's identifier | | org.apache.manifoldcf.crawler.ImportConfiguration | Import configuration = as written by a previous ExportConfiguration command | | org.apache.manifoldcf.crawler.ListJobStatuses | List the status of all jo= bs | | org.apache.manifoldcf.crawler.ListJobs | List the identifiers for all job= s | | org.apache.manifoldcf.crawler.PauseJob | Given a job identifier, pause th= e specified job | | org.apache.manifoldcf.crawler.RestartJob | Given a job identifier, restar= t the specified job | | org.apache.manifoldcf.crawler.RunDocumentStatus | Run a document status r= eport | | org.apache.manifoldcf.crawler.RunMaxActivityHistory | Run a maximum activ= ity report | | org.apache.manifoldcf.crawler.RunMaxBandwidthHistory | Run a maximum band= width report | | org.apache.manifoldcf.crawler.RunQueueStatus | Run a queue status report = | | org.apache.manifoldcf.crawler.RunResultHistory | Run a result history rep= ort | | org.apache.manifoldcf.crawler.RunSimpleHistory | Run a simply history rep= ort | | org.apache.manifoldcf.crawler.StartJob | Start a job | | org.apache.manifoldcf.crawler.WaitForJobDeleted | After a job has been de= leted, wait until the delete has completed | | org.apache.manifoldcf.crawler.WaitForJobInactive | After a job has been s= tarted or aborted, wait until the job ceases all activity | | org.apache.manifoldcf.crawler.WaitJobPaused | After a job has been paused= , wait for the pause to take effect | h3. Control by direct code Control by direct java code is quite a reasonable thing to do. The sources= of the above commands should give a pretty clear idea how to proceed, if t= hat's what you want to do. h3. Caveats The above commands know nothing about the differences between connection ty= pes. Instead, they deal with configuration and specification information i= n the form of XML documents. Normally, these XML documents are hidden from= a system integrator, unless they happen to look into the database with a t= ool such as psql. But the API commands above often will require such XML d= ocuments to be included as part of the command execution. This has one major consequence. Any application that would manipulate conn= ections and jobs directly cannot be connection-type independent - these app= lications must know the proper form of XML to submit to the command. So, i= t is not possible to use these command APIs to write one's own UI wrapper, = without sacrificing some of the repository independence that ManifoldCF by = itself maintains. Change your notification preferences: https://cwiki.apache.org/confluence/u= sers/viewnotifications.action