Mailing-List: contact connectors-commits-help@incubator.apache.org;
 run by ezmlm
Precedence: bulk
Reply-To: connectors-dev@incubator.apache.org
Date: Mon, 4 Oct 2010 15:02:00 -0400 (EDT)
From: confluence@apache.org
To: connectors-commits@incubator.apache.org
Message-ID: <11377506.21133.1286218920014.JavaMail.confluence@thor>
Subject: [CONF] Apache Connectors Framework > Programmatic Operation of
 ManifoldCF
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Auto-Submitted: auto-generated

Space: Apache Connectors Framework (https://cwiki.apache.org/confluence/dis=
play/CONNECTORS)
Page: Programmatic Operation of ManifoldCF (https://cwiki.apache.org/conflu=
ence/display/CONNECTORS/Programmatic+Operation+of+ManifoldCF)


Edited by Karl Wright:
---------------------------------------------------------------------
h1. Programmatic Operation of ManifoldCF

A certain subset of ManifoldCF users want to think of ManifoldCF as an engi=
ne that they can poke from whatever other system they are developing.  Whil=
e ManifoldCF is not precisely a document indexing engine per se, it can cer=
tainly be controlled programmatically.  Right now, there are three principl=
e ways of achieving this control.

h3. Control by Servlet API

ManifoldCF provides a servlet-based JSON API that gives you the complete ab=
ility to define connections and jobs, and control job execution.  You can r=
ead about JSON [here|http://www.json.org].  The API is designed to be RESTf=
ul in character.  Thus, it makes full use of the HTTP verbs GET, PUT, POST,=
 and DELETE, and represents objects as URLs.  The basic format of the JSON =
servlet resource URLs is as follows:

http\[s\]://_<server_and_port>_/mcf-api-service/json/_<resource>_

The servlet ignores request data, except when the PUT or POST verb is used.=
  In that case, the request data is presumed to be a JSON object.  The serv=
let responds either with an error response code (either 400 or 500) with an=
 appropriate explanatory message, or with a 200 (OK), 201 (CREATED), or 404=
 (NOT FOUND) response code along with a response JSON object.

The actual available resources and commands are as follows:


|| Resource || Verb || What it does || Input format || Output format ||
| outputconnectors | GET | List all registered output connectors | N/A | \{=
"outputconnector":\[_<list_of_output_connector_objects>_\]\} *OR* \{"error"=
:_<error_text>_\} |
| authorityconnectors | GET | List all registered authority connectors | N/=
A | \{"authorityconnector":\[_<list_of_authority_connector_objects>_\]\} *O=
R* \{"error":_<error_text>_\} |
| repositoryconnectors | GET | List all registered repository connectors | =
N/A | \{"repositoryconnector":\[_<list_of_repository_connector_objects>_\]\=
} *OR* \{"error":_<error_text>_\} |
| outputconnections | GET | List all output connections | N/A | \{"outputco=
nnection":\[_<list_of_output_connection_objects>_\]\} *OR* \{"error":_<erro=
r_text>_\} |
| outputconnections/_<encoded_connection_name>_ | GET | Get a specific outp=
ut connection | N/A | \{"outputconnection":_<output_connection_object>_\} *=
OR* \{ \} *OR* \{"error":_<error_text>_\} |
| outputconnections/_<encoded_connection_name>_ | PUT | Save or create an o=
utput connection | \{"outputconnection":_<output_connection_object>_\} | \{=
"connection_name":_<connection_name>_\} *OR* \{"error":_<error_text>_\} |
| outputconnections/_<encoded_connection_name>_ | DELETE | Delete an output=
 connection | N/A | \{ \} *OR* \{"error":_<error_text>_\} |
| status/outputconnections/_<encoded_connection_name>_ | GET | Check the st=
atus of an output connection | N/A | \{"check_result":_<message>_\} *OR* \{=
"error":_<error_text>_\} |
| info/outputconnections/_<encoded_connection_name>_/_<connector_specific_r=
esource>_ | GET | Retrieve arbitrary connector-specific resource | N/A | _<=
response_data>_ *OR* \{"error":_<error_text>_\} *OR* \{"service_interruptio=
n":_<error_text>_\} |
| authorityconnections | GET | List all authority connections | N/A | \{"au=
thorityconnection":\[_<list_of_authority_connection_objects>_\]\} *OR* \{"e=
rror":_<error_text>_\} |
| authorityconnections/_<encoded_connection_name>_ | GET | Get a specific a=
uthority connection | N/A | \{"authorityconnection":_<authority_connection_=
object>_\} *OR* \{ \} *OR* \{"error":_<error_text>_\} |
| authorityconnections/_<encoded_connection_name>_ | PUT | Save or create a=
n authority connection | \{"authorityconnection":_<authority_connection_obj=
ect>_\} | \{"connection_name":_<connection_name>_\} *OR* \{"error":_<error_=
text>_\} |
| authorityconnections/_<encoded_connection_name>_ | DELETE | Delete an aut=
hority connection | N/A | \{ \} *OR* \{"error":_<error_text>_\} |
| status/authorityconnections/_<encoded_connection_name>_ | GET | Check the=
 status of an authority connection | N/A | \{"check_result":_<message>_\} *=
OR* \{"error":_<error_text>_\} |
| repositoryconnections | GET | List all repository connections | N/A | \{"=
repositoryconnection":\[_<list_of_repository_connection_objects>_\]\} *OR* =
\{"error":_<error_text>_\} |
| repositoryconnections/_<encoded_connection_name>_ | GET | Get a specific =
repository connection | N/A | \{"repositoryconnection":_<repository_connect=
ion_object>_\} *OR* \{ \} *OR* \{"error":_<error_text>_\} |
| repositoryconnections/_<encoded_connection_name>_ | PUT | Save or create =
a repository connection | \{"repositoryconnection":_<repository_connection_=
object>_\} | \{"connection_name":_<connection_name>_\} *OR* \{"error":_<err=
or_text>_\} |
| repositoryconnections/_<encoded_connection_name>_ | DELETE | Delete a rep=
ository connection | N/A | \{ \} *OR* \{"error":_<error_text>_\} |
| status/repositoryconnections/_<encoded_connection_name>_ | GET | Check th=
e status of a repository connection | N/A | \{"check_result":_<message>_\} =
*OR* \{"error":_<error_text>_\} |
| info/repositoryconnections/_<encoded_connection_name>_/_<connector_specif=
ic_resource>_ | GET | Retrieve arbitrary connector-specific resource | N/A =
| _<response_data>_ *OR* \{"error":_<error_text>_\} *OR* \{"service_interru=
ption":_<error_text>_\} |
| jobs | GET | List all job definitions | N/A | \{"job":\[_<list_of_job_obj=
ects>_\]\} *OR* \{"error":_<error_text>_\} |
| jobs | POST | Create a job | \{"job":_<job_object>_\} | \{"job_id":_<job_=
identifier>_\} *OR* \{"error":_<error_text>_\} |
| jobs/_<job_id>_ | GET | Get a specific job definition | N/A | \{"job":_<j=
ob_object_>\} *OR* \{ \} *OR* \{"error":_<error_text>_\} |
| jobs/_<job_id>_ | PUT | Save a job definition | \{"job":_<job_object>_\} =
| \{"job_id":_<job_identifier>_\} *OR* \{"error":_<error_text>_\} |
| jobs/_<job_id>_ | DELETE | Delete a job definition | N/A | \{ \} *OR* \{"=
error":_<error_text>_\} |
| jobstatuses | GET | List all jobs and their status | N/A | \{"job":\[_<li=
st_of_job_status_objects>_\]\} *OR* \{"error":_<error_text>_\} |
| jobstatuses/_<job_id>_ | GET | Get a specific job's status | N/A | \{"job=
status":_<job_status_object>\} *OR* \{ \} *OR* \{"error":_<error_text>_\}  =
|
| start/_<job_id>_ | PUT | Start a specified job manually | N/A | \{ \} *OR=
* \{"error":_<error_text>_\} |
| abort/_<job_id>_ | PUT | Abort a specified job | N/A | \{ \} *OR* \{"erro=
r":_<error_text>_\} |
| restart/_<job_id>_ | PUT | Stop and start a specified job | N/A | \{ \} *=
OR* \{"error":_<error_text>_\} |
| pause/_<job_id>_ | PUT | Pause a specified job | N/A | \{ \} *OR* \{"erro=
r":_<error_text>_\} |
| resume/_<job_id>_ | PUT | Resume a specified job | N/A | \{ \} *OR* \{"er=
ror":_<error_text>_\} |

Other resources having to do with reports have been planned, but not yet be=
en implemented.

h5. Output connector objects

The JSON fields an output connector object has are as follows:

|| Field || Meaning ||
| "description" | The optional description of the connector |
| "class_name" | The class name of the class implementing the connector |

h5. Authority connector objects

The JSON fields an authority connector object has are as follows:

|| Field || Meaning ||
| "description" | The optional description of the connector |
| "class_name" | The class name of the class implementing the connector |

h5. Repository connector objects

The JSON fields a repository connector object has are as follows:

|| Field || Meaning ||
| "description" | The optional description of the connector |
| "class_name" | The class name of the class implementing the connector |

h5. Output connection objects

Output connection names, when they are part of a URL, should be encoded as =
follows:

# All instances of '.' should be replaced by '..'.
# All instances of '/' should be replaced by '.+'.
# The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields an output connection object has are as follows:

|| Field || Meaning ||
| "name" | The unique name of the connection |
| "description" | The description of the connection |
| "class_name" | The java class name of the class implementing the connecti=
on |
| "max_connections" | The total number of outstanding connections allowed t=
o exist at a time |
| "configuration" | The configuration object for the connection, which is s=
pecific to the connection class |

h5. Authority connection objects

Authority connection names, when they are part of a URL, should be encoded =
as follows:

# All instances of '.' should be replaced by '..'.
# All instances of '/' should be replaced by '.+'.
# The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for an authority connection object are as follows:

|| Field || Meaning ||
| "name" | The unique name of the connection |
| "description" | The description of the connection |
| "class_name" | The java class name of the class implementing the connecti=
on |
| "max_connections" | The total number of outstanding connections allowed t=
o exist at a time |
| "configuration" | The configuration object for the connection, which is s=
pecific to the connection class |

h5. Repository connection objects

Repository connection names, when they are part of a URL, should be encoded=
 as follows:

# All instances of '.' should be replaced by '..'.
# All instances of '/' should be replaced by '.+'.
# The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for a repository connection object are as follows:

|| Field || Meaning ||
| "name" | The unique name of the connection |
| "description" | The description of the connection |
| "class_name" | The java class name of the class implementing the connecti=
on |
| "max_connections" | The total number of outstanding connections allowed t=
o exist at a time |
| "configuration" | The configuration object for the connection, which is s=
pecific to the connection class |
| "acl_authority" | The (optional) name of the authority that will enforce =
security for this connection |
| "throttle" | An array of throttle objects, which control how quickly docu=
ments can be requested from this connection |

Each throttle object has the following fields:

|| Field || Meaning ||
| "match" | The regular expression which is used to match a document's bins=
 to determine if the throttle should be applied |
| "match_description" | Optional text describing the meaning of the throttl=
e |
| "rate" | The maximum fetch rate to use if the throttle applies, in fetche=
s per minute |

h5. Job objects

The JSON fields for a job are is as follows:

|| Field || Meaning ||
| "id" | The job's identifier, if present.  If not present, ManifoldCF will=
 create one (and will also create the job when saved). |
| "description" | Text describing the job |
| "repository_connection" | The name of the repository connection to use wi=
th the job |
| "output_connection" | The name of the output connection to use with the j=
ob |
| "document_specification" | The document specification object for the job,=
 whose format is repository-connection specific |
| "output_specification" | The output specification object for the job, who=
se format is output-connection specific |
| "start_mode" | The start mode for the job, which can be one of "schedule =
window start", "schedule window anytime", or "manual" |
| "run_mode" | The run mode for the job, which can be either "continuous" o=
r "scan once" |
| "hopcount_mode" | The hopcount mode for the job, which can be either "acc=
urate", "no delete", "never delete" |
| "priority" | The job's priority, typically "5" |
| "recrawl_interval" | The default time between recrawl of documents (if th=
e job is "continuous"), in milliseconds, or "infinite" for infinity |
| "expiration_interval" | The time until a document expires (if the job is =
"continuous"), in milliseconds, or "infinite" for infinity |
| "reseed_interval" | The time between reseeding operations (if the job is =
"continuous"), in milliseconds, or "infinite" for infinity |
| "hopcount" | An array of hopcount objects, describing the link types and =
associated maximum hops permitted for the job |
| "schedule" | An array of schedule objects, describing when the job should=
 be started and run |

Each hopcount object has the following fields:

|| Field || Meaning ||
| "link_type" | The connection-type-dependent type of a link for which a ho=
p count restriction is specified |
| "count" | The maximum number of hops allowed for the associated link type=
, starting at a seed |

Each schedule object has the following fields:

|| Field || Meaning ||
| "timezone" | The optional time zone for the schedule object; if not prese=
nt the default server time zone is used |
| "duration" | The optional length of the described time window, in millise=
conds; if not present, duration is considered infinite |
| "dayofweek" | The optional day-of-the-week enumeration object |
| "monthofyear" | The optional month-of-the-year enumeration object |
| "dayofmonth" | The optional day-of-the-month enumeration object |
| "year" | The optional year enumeration object |
| "hourofday" | The optional hour-of-the-day enumeration object |
| "minutesofhour" | The optional minutes-of-the-hour enumeration object |

Each enumeration object describes an array of integers using the form:

\{"value":\[_<integer_list>_\]\}

Each integer is a zero-based index describing which entity is being specifi=
ed.  For example, for "dayofweek", 0 corresponds to Sunday, etc., and thus =
"dayofweek":\{"value":\[0,6\]\} would describe Saturdays and Sundays.

h5. Job status objects

The JSON fields of a job status object are as follows:

|| Field || Meaning ||
| "job_id" | The job identifier |
| "status" | The job status, having the possible values: "not yet run", "ru=
nning", "paused", "done", "waiting", "starting up", "cleaning up", "error",=
 "aborting", "restarting", "running no connector", and "terminating" |
| "error_text" | The error text, if the status is "error" |
| "start_time" | The job start time, in milliseconds since Jan 1, 1970 |
| "end_time" | The job end time, in milliseconds since Jan 1, 1970 |
| "documents_in_queue" | The total number of documents in the queue for the=
 job |
| "documents_outstanding" | The number of documents for the job that are cu=
rrently considered 'active' |
| "documents_processed" | The number of documents that in the queue for the=
 job that have been processed at least once |

h5. Connection-type-specific objects

As you may note when trying to use the above JSON API methods, you cannot g=
et very far in defining connections or jobs without knowing the JSON format=
 of a connection's configuration information, or a job's connection-specifi=
c document specification and output specification information.  The form of=
 these objects is controlled by the Java implementation of the underlying c=
onnector, and is translated directly into JSON, so if you write your own co=
nnector you should be able to figure out what it will be in the API.  For c=
onnectors already part of ManifoldCF, it remains an ongoing task to documen=
t these connector-specific objects.  This task is not yet underway.

Luckily, it is pretty easy to learn a lot about the objects in question by =
simply creating connections and jobs in the ManifoldCF crawler UI, and then=
 inspecting the resulting JSON objects through the API.  In this way, it sh=
ould be possible to do a decent job of coding most API-based integrations. =
 The one place where difficulties will certainly occur will be if you try t=
o completely replace the ManifoldCF crawler UI with one of your own.  This =
is because most connectors have methods that communicate with their respect=
ive back-ends in order to allow the user to select appropriate values.  For=
 example, the path drill-down that is presented by the LiveLink connector r=
equires that the connector interrogate the appropriate LiveLink repository =
in order to populate its path selection pull-downs.  There is, at this time=
, only one sanctioned way to accomplish the same job using the API, which i=
s to use the appropriate "_connection_type_/execute/_type-specific_command_=
" command to perform the necessary functions.  Some set of useful functions=
 has been coded for every appropriate connector, but the exact commands for=
 every connector, and their JSON syntax, remains undocumented for now.

h5. File system connector

The file system connector has no configuration information, and no connecto=
r-specific commands.  However, it does have document specification informat=
ion.  The information looks something like this:

\{"startpoint":\[\{"_attribute_path":"c:\path_to_files","include":\[\{"_att=
ribute_type":"file","_attribute_match":"\*.txt"\},\{"_attribute_type":"file=
","_attribute_match":"\*.doc"\,"_attribute_type":"directory","_attribute_ma=
tch":"\*"\],"exclude":\["*.mov"\]\]\}

As you can see, multiple starting paths are possible, and the inclusion and=
 exclusion rules also can be one or multiple.


h3. Control via Commands

For script writers, there currently exist a number of ManifoldCF execution =
commands.  These commands are primarily rich in the area of definition of c=
onnections and jobs, controlling jobs, and running reports.  The following =
table lists the current suite.

|| Command || What it does ||
| org.apache.manifoldcf.agents.DefineOutputConnection | Create a new output=
 connection |
| org.apache.manifoldcf.agents.DeleteOutputConnection | Delete an existing =
output connection |
| org.apache.manifoldcf.authorities.ChangeAuthSpec | Modify an authority's =
configuration information |
| org.apache.manifoldcf.authorities.CheckAll | Check all authorities to be =
sure they are functioning |
| org.apache.manifoldcf.authorities.DefineAuthorityConnection | Create a ne=
w authority connection |
| org.apache.manifoldcf.authorities.DeleteAuthorityConnection | Delete an e=
xisting authority connection |
| org.apache.manifoldcf.crawler.AbortJob | Abort a running job |
| org.apache.manifoldcf.crawler.AddScheduledTime | Add a schedule record to=
 a job |
| org.apache.manifoldcf.crawler.ChangeJobDocSpec | Modify a job's specifica=
tion information |
| org.apache.manifoldcf.crawler.DefineJob | Create a new job |
| org.apache.manifoldcf.crawler.DefineRepositoryConnection | Create a new r=
epository connection |
| org.apache.manifoldcf.crawler.DeleteJob | Delete an existing job |
| org.apache.manifoldcf.crawler.DeleteRepositoryConnection | Delete an exis=
ting repository connection |
| org.apache.manifoldcf.crawler.ExportConfiguration | Write the complete li=
st of all connection definitions and job specifications to a file |
| org.apache.manifoldcf.crawler.FindJob | Locate a job identifier given a j=
ob's name |
| org.apache.manifoldcf.crawler.GetJobSchedule | Find a job's schedule give=
n a job's identifier |
| org.apache.manifoldcf.crawler.ImportConfiguration | Import configuration =
as written by a previous ExportConfiguration command |
| org.apache.manifoldcf.crawler.ListJobStatuses | List the status of all jo=
bs |
| org.apache.manifoldcf.crawler.ListJobs | List the identifiers for all job=
s |
| org.apache.manifoldcf.crawler.PauseJob | Given a job identifier, pause th=
e specified job |
| org.apache.manifoldcf.crawler.RestartJob | Given a job identifier, restar=
t the specified job |
| org.apache.manifoldcf.crawler.RunDocumentStatus | Run a document status r=
eport |
| org.apache.manifoldcf.crawler.RunMaxActivityHistory | Run a maximum activ=
ity report |
| org.apache.manifoldcf.crawler.RunMaxBandwidthHistory | Run a maximum band=
width report |
| org.apache.manifoldcf.crawler.RunQueueStatus | Run a queue status report =
|
| org.apache.manifoldcf.crawler.RunResultHistory | Run a result history rep=
ort |
| org.apache.manifoldcf.crawler.RunSimpleHistory | Run a simply history rep=
ort |
| org.apache.manifoldcf.crawler.StartJob | Start a job |
| org.apache.manifoldcf.crawler.WaitForJobDeleted | After a job has been de=
leted, wait until the delete has completed |
| org.apache.manifoldcf.crawler.WaitForJobInactive | After a job has been s=
tarted or aborted, wait until the job ceases all activity |
| org.apache.manifoldcf.crawler.WaitJobPaused | After a job has been paused=
, wait for the pause to take effect |

h3. Control by direct code

Control by direct java code is quite a reasonable thing to do.  The sources=
 of the above commands should give a pretty clear idea how to proceed, if t=
hat's what you want to do.


h3. Caveats

The above commands know nothing about the differences between connection ty=
pes.  Instead, they deal with configuration and specification information i=
n the form of XML documents.  Normally, these XML documents are hidden from=
 a system integrator, unless they happen to look into the database with a t=
ool such as psql.  But the API commands above often will require such XML d=
ocuments to be included as part of the command execution.

This has one major consequence.  Any application that would manipulate conn=
ections and jobs directly cannot be connection-type independent - these app=
lications must know the proper form of XML to submit to the command.  So, i=
t is not possible to use these command APIs to write one's own UI wrapper, =
without sacrificing some of the repository independence that ManifoldCF by =
itself maintains.


Change your notification preferences: https://cwiki.apache.org/confluence/u=
sers/viewnotifications.action