manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CONNECTORS-962) Support multiple output connections for a single job
Date Wed, 11 Jun 2014 16:43:11 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027990#comment-14027990
] 

Karl Wright edited comment on CONNECTORS-962 at 6/11/14 4:42 PM:
-----------------------------------------------------------------

Solution to the RepositoryDocument "multiple consumers" problem:
- RepositoryDocument actually contains an input stream wrapper, which is backed by the original
  InputStream
- The wrapper "mints" a new actual fresh InputStream whenever asked
- As the original stream is read, the data is stored locally in a temporary file, and the
file name is kept by the wrapper object
- When the wrapper object is closed, the temporary file is deleted (if it was created in the
first place)
- The tricky part: in order to know whether to create the temporary file, we *must* know at
the start whether there will be more than one consumer of the stream.  RepositoryDocument.setMultipleConsumers()
would be a possibility, if called before the first read.  We can interrogate each connector
in a pipeline to find out how many consumers there will be, so that we can set this parameter
in the framework before any of them are called.

This functionality is also very useful for transformation connectors, so even if the multiple
outputs logic is not implemented yet, I believe I will go ahead and write the stream wrapper
logic.



was (Author: kwright@metacarta.com):
Solution to the RepositoryDocument "multiple consumers" problem:
- RepositoryDocument actually contains an input stream wrapper, which is backed by the original
  InputStream
- The wrapper "mints" a new actual fresh InputStream whenever asked
- As the original stream is read, the data is stored locally in a temporary file, and the
file name is kept by the wrapper object
- When the wrapper object is closed, the temporary file is deleted (if it was created in the
first place)
- The tricky part: in order to know whether to create the temporary file, we *must* know at
the start whether there will be more than one consumer of the stream.  RepositoryDocument.setMultipleConsumers()
would be a possibility, if called before the first read.  We can interrogate each connector
in a pipeline to find out how many consumers there will be, so that we can set this parameter
in the framework before any of them are called.


> Support multiple output connections for a single job
> ----------------------------------------------------
>
>                 Key: CONNECTORS-962
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-962
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> Zaizi has a requirement to support multiple outputs for a single job.  In theory this
requirement can be met by doing the following:
> - Allow multiple output connections, and multiple pipelines, per job
> - Keep a distinct ingeststatus record for each document/output combination
> - Modify WorkerThread to call IncrementalIndexer multiple times for every document fetched
> Places where different things need to happen are:
> - RepositoryDocument - because one binary stream will not do for multiple outputs
> - UI, obviously, because there will need to be multiple pipelines, not just one, and
in addition it would be probably important to be able to "split" the pipeline at arbitrary
points



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message