manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Configuration Management at Transformation Connectors
Date Tue, 01 Jul 2014 17:34:49 GMT
I should also clarify that another reason that the current
pipelineDescription string design is important is because otherwise there
would be a possibility of something which affects indexing or
transformation *not* getting properly included in the version string.  Such
an omission would, of course, break incremental indexing, because there
would be no way to detect that a pertinent change had taken place.

Thanks,
Karl


On Tue, Jul 1, 2014 at 1:21 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Rafa,
>
> Let me answer one question at a time.
>
> bq. I would like to initialize the configuration object only once per job
> execution. Because the configuration is not supposed to be changed during a
> job execution, I would like to be able to take the configuration parameters
> from ConfigParams and from Specification objects and create a unique
> instance of my configuration object.
>
> Connection instances are all pooled and reused.  You need to read about
> their lifetime.  ManifoldCF in Action chapter 6 (IIRC) is where you will
> find this: https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
> You should also be aware that there is *no* prohibition on configuration
> or specification changing during a job run; the framework is structured,
> however, so that you don't need to worry about this when writing your
> connector.
>
>
> bq. The getPipelineDescription method is quite confusing for me...
>
> Getting a version string and indexing a document may well be separated in
> time, and since it is possible for things to change in-between, the version
> string should be the basis of decisions your connector is making about how
> to do things.  The version string is what gets actually stored in the DB,
> so any differences will be picked up on later crawls.
>
> FWIW, the IRepositoryConnnector interface predates the decision to not
> include a document specification for every method call, and that has
> persisted for backwards compatibility reasons, although in MCF 2.0 that may
> change.  The current design enforces proper connector coding.
>
> bq. In the addOrReplaceDocumentWithExcept
> ion, why is the pipelineDescription passed by parameter instead of the
> connector Specification...?
>
> See answer above.
>
>
> bq. Is there a way to reuse a single configuration object per job
> execution? In the Output processor connector, I used to initialize my
> custom stuff in the connect method (I'm not sure if this strategy is valid
> anyway), but for the Transformation connectors I'm not even sure if this
> method is called.
>
> You really aren't supposed to have a *single* object, but rather one per
> connection instance.  Connection instances are long-lived, remember.  That
> object should also expire eventually if there is no use.  There's a
> particular design pattern you should try to adhere to, which is to have a
> getSession() method that sets up your long-lived member object, and have
> the poll() method free it after a certain amount of inactivity.  Pretty
> much all connectors these days use this pattern; for a modern
> implementation, have a look at the Jira connector.
>
>
> FWIW, there's no MCF in Action chapter on transformation connectors yet,
> but they are quite similar to output connectors in many respects, so
> reading Chapter 9 may help a bit.
>
> Thanks,
> Karl
>
>
>
>
> On Tue, Jul 1, 2014 at 1:04 PM, Rafa Haro <rharo@apache.org> wrote:
>
>> Hi guys,
>>
>> I'm trying to develop my first Transformation Connector. Before starting
>> to code, I have tried to read first enough documentation and I have also
>> studied the Tika extractor as transformation connector example. Currently,
>> I'm just trying to implement an initial version of my connector, starting
>> with something simple to later complicate the things a little bit. The
>> first problem I'm facing is the configuration management, where I'm
>> probably missing something. In my case, I need a fixed configuration while
>> creating an instance of the connector and a extended configuration per job.
>> Let's say that the connector configuration has to setup a service and the
>> job configuration will define how the service should work for each job.
>> With both configurations, I need to create an object which is expensive to
>> instantiate. Here is where the doubts raise:
>>
>> 1. I would like to initialize the configuration object only once per job
>> execution. Because the configuration is not supposed to be changed during a
>> job execution, I would like to be able to take the configuration parameters
>> from ConfigParams and from Specification objects and create a unique
>> instance of my configuration object.
>>
>> 2. The getPipelineDescription method is quite confusing for me. In the
>> Tika Extractor, this method is used to pack in a string the configuration
>> of the Tika processor. Then this string is again unpacked in the
>> addOrReplaceDocumentWithException method to read the documentation. My
>> question is why?. As far as I understand, the configuration can't change
>> during the job execution and according to the documentation "the contents
>> of the document cannot be considered by this method, and that a different
>> version string (defined in IRepositoryConnector) is used to describe the
>> version of the actual document". So, if only configuration data can be used
>> to create the output version string, probably this version string can be
>> checked by the system before starting the job and not produced and checked
>> per document because basically all the documents are going to produce the
>> same exact output version string. Probably I'm missing something but, for
>> example, looking at Tika Transformation connector seems to be pretty clear
>> that there would be no difference between output version strings for all
>> the documents because it is using only configuration data to create the
>> string.
>>
>> 3.In the addOrReplaceDocumentWithException, why is the
>> pipelineDescription passed by parameter instead of the connector
>> Specification to ease the developer to access the configuration without
>> marshalling and unmarshalling it?
>>
>> 4. Is there a way to reuse a single configuration object per job
>> execution? In the Output processor connector, I used to initialize my
>> custom stuff in the connect method (I'm not sure if this strategy is valid
>> anyway), but for the Transformation connectors I'm not even sure if this
>> method is called.
>>
>> Thanks a lot for your help beforehand. Please note that the questions of
>> course are not intended to be criticism. This mail is just a dump of doubts
>> that probably will help me to better understand the workflows in manifold
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message