manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-962) Support multiple output connections for a single job
Date Wed, 11 Jun 2014 23:55:02 GMT


Karl Wright commented on CONNECTORS-962:

Ok, here is a list of what I think needs to be done.
- NO changes needed to the ingeststatus table
- The interface to IIncrementalIngester should change to include a complete description of
the pipeline, including multiple output connections - more about how that should be represented
- The pipelines table should change to include two more columns: (1) isoutput (boolean), and
(2) prerequisiterank (integer)
- The jobs table should remove the dedicated outputname column (and the upgrade should push
it into the pipelines table instead)
- IJobDescription needs to change to remove the dedicated output connection name and instead
treat outputs as pipeline stages.  A pipeline stage gets an additional int which functions
as an ID, a boolean describing whether it is transformation or output, and an additional reference
to a prerequisite (which is a reference to an ID)
- Crawler UI needs to change for display and editing (already have some ideas there)
- WorkerThread needs to change to assemble the pipeline specification for the incremental
- Incremental ingester needs to assemble more complex pipelines than before; pipelines are
effectively evaluated still in strict rank order, but whenever there is a downstream dependency
on a particular result, a copy is made and is set aside for later use.
- Actual evaluation in Java probably works best if each pipeline stage is called as soon as
its RepositoryDocument is available; in that way cleanup can be done in a finally block
- Editing of pipeline needs to update back references in some way, but since there's an ID
in place, a delete should rewire references to the deleted item to go instead to that item's

> Support multiple output connections for a single job
> ----------------------------------------------------
>                 Key: CONNECTORS-962
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
> Zaizi has a requirement to support multiple outputs for a single job.  In theory this
requirement can be met by doing the following:
> - Allow multiple output connections, and multiple pipelines, per job
> - Keep a distinct ingeststatus record for each document/output combination
> - Modify WorkerThread to call IncrementalIndexer multiple times for every document fetched
> Places where different things need to happen are:
> - RepositoryDocument - because one binary stream will not do for multiple outputs
> - UI, obviously, because there will need to be multiple pipelines, not just one, and
in addition it would be probably important to be able to "split" the pipeline at arbitrary

This message was sent by Atlassian JIRA

View raw message