manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-946) Add support for pipeline connector
Date Tue, 27 May 2014 23:45:02 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010505#comment-14010505
] 

Karl Wright commented on CONNECTORS-946:
----------------------------------------

On second thought, in order to be able to maintain the ability to detect configuration changes,
the Pipeline Connector will have to have a version string.  This changes the design quite
a bit:

- The pipeline connection list for processing is built right in the job
- Each job has an ordered list of pipeline connections it runs on every document (in a new
database table)
- Pipeline connections can have job tabs in the UI (although we have to figure out something
to avoid collisions when the same connection type appears more than once in one job -- maybe
pass in the pipeline connection name as a parameter to the UI methods
- There's a TranslationSpecification equivalent to an OutputSpecification or DocumentSpecification,
and a pipeline connector method that explicitly maps the TranslationSpecification to a version
string
- The transformDocument() method accepts the version string, and uses that where appropriate
to control the transformation
- The ingeststatus table has a new sidecar table that holds onto pipeline connection version
strings, for comparison

It's critical that the performance of the ingeststatus table does not suffer unless there
are configured pipeline steps, but I think that would be relatively straightforward to do,
since pipeline version strings will be directly requested by the worker threads when evaluating
whether a document has changed.


> Add support for pipeline connector
> ----------------------------------
>
>                 Key: CONNECTORS-946
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-946
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In the Amazon Search Connector, we finally found an example of an output connector that
needed to do full document processing in order to work.  This ticket represents work in the
framework to create a concept of "pipeline connector".  Pipeline connections would receive
RepositoryDocument objects, and transform them to new RepositoryDocument objects.  There would
be a single important method:
> {code}
> public void transformDocument(RepositoryDocument rd, ITransformationActivities activities)
throws ServiceInterruption, ManifoldCFException;
> {code}
> ... where ITransformationActivities would include a method that would send a RepositoryDocument
object onward to either the output connection or to the next pipeline connection.
> Each pipeline connection would have:
> - A name
> - A description
> - Configuration data
> - An optional prerequisite pipeline connection
> Every output connection would have a new field, which is an optional prerequisite pipeline
connection.
> This design is based loosely on how mapping connections and authority connections interrelate.
 An alternate design would involve having per-job specification information, but I think this
would wind up being way too complex for very little benefit, since each pipeline connection/stage
would be expected to do relatively simple/granular things, not usually involving interaction
with an external system.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message