manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Configuration Management at Transformation Connectors
Date Tue, 01 Jul 2014 17:21:05 GMT
Hi Rafa,

Let me answer one question at a time.

bq. I would like to initialize the configuration object only once per job
execution. Because the configuration is not supposed to be changed during a
job execution, I would like to be able to take the configuration parameters
from ConfigParams and from Specification objects and create a unique
instance of my configuration object.

Connection instances are all pooled and reused.  You need to read about
their lifetime.  ManifoldCF in Action chapter 6 (IIRC) is where you will
find this: https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
You should also be aware that there is *no* prohibition on configuration or
specification changing during a job run; the framework is structured,
however, so that you don't need to worry about this when writing your
connector.


bq. The getPipelineDescription method is quite confusing for me...

Getting a version string and indexing a document may well be separated in
time, and since it is possible for things to change in-between, the version
string should be the basis of decisions your connector is making about how
to do things.  The version string is what gets actually stored in the DB,
so any differences will be picked up on later crawls.

FWIW, the IRepositoryConnnector interface predates the decision to not
include a document specification for every method call, and that has
persisted for backwards compatibility reasons, although in MCF 2.0 that may
change.  The current design enforces proper connector coding.

bq. In the addOrReplaceDocumentWithExcept
ion, why is the pipelineDescription passed by parameter instead of the
connector Specification...?

See answer above.


bq. Is there a way to reuse a single configuration object per job
execution? In the Output processor connector, I used to initialize my
custom stuff in the connect method (I'm not sure if this strategy is valid
anyway), but for the Transformation connectors I'm not even sure if this
method is called.

You really aren't supposed to have a *single* object, but rather one per
connection instance.  Connection instances are long-lived, remember.  That
object should also expire eventually if there is no use.  There's a
particular design pattern you should try to adhere to, which is to have a
getSession() method that sets up your long-lived member object, and have
the poll() method free it after a certain amount of inactivity.  Pretty
much all connectors these days use this pattern; for a modern
implementation, have a look at the Jira connector.


FWIW, there's no MCF in Action chapter on transformation connectors yet,
but they are quite similar to output connectors in many respects, so
reading Chapter 9 may help a bit.

Thanks,
Karl




On Tue, Jul 1, 2014 at 1:04 PM, Rafa Haro <rharo@apache.org> wrote:

> Hi guys,
>
> I'm trying to develop my first Transformation Connector. Before starting
> to code, I have tried to read first enough documentation and I have also
> studied the Tika extractor as transformation connector example. Currently,
> I'm just trying to implement an initial version of my connector, starting
> with something simple to later complicate the things a little bit. The
> first problem I'm facing is the configuration management, where I'm
> probably missing something. In my case, I need a fixed configuration while
> creating an instance of the connector and a extended configuration per job.
> Let's say that the connector configuration has to setup a service and the
> job configuration will define how the service should work for each job.
> With both configurations, I need to create an object which is expensive to
> instantiate. Here is where the doubts raise:
>
> 1. I would like to initialize the configuration object only once per job
> execution. Because the configuration is not supposed to be changed during a
> job execution, I would like to be able to take the configuration parameters
> from ConfigParams and from Specification objects and create a unique
> instance of my configuration object.
>
> 2. The getPipelineDescription method is quite confusing for me. In the
> Tika Extractor, this method is used to pack in a string the configuration
> of the Tika processor. Then this string is again unpacked in the
> addOrReplaceDocumentWithException method to read the documentation. My
> question is why?. As far as I understand, the configuration can't change
> during the job execution and according to the documentation "the contents
> of the document cannot be considered by this method, and that a different
> version string (defined in IRepositoryConnector) is used to describe the
> version of the actual document". So, if only configuration data can be used
> to create the output version string, probably this version string can be
> checked by the system before starting the job and not produced and checked
> per document because basically all the documents are going to produce the
> same exact output version string. Probably I'm missing something but, for
> example, looking at Tika Transformation connector seems to be pretty clear
> that there would be no difference between output version strings for all
> the documents because it is using only configuration data to create the
> string.
>
> 3.In the addOrReplaceDocumentWithException, why is the
> pipelineDescription passed by parameter instead of the connector
> Specification to ease the developer to access the configuration without
> marshalling and unmarshalling it?
>
> 4. Is there a way to reuse a single configuration object per job
> execution? In the Output processor connector, I used to initialize my
> custom stuff in the connect method (I'm not sure if this strategy is valid
> anyway), but for the Transformation connectors I'm not even sure if this
> method is called.
>
> Thanks a lot for your help beforehand. Please note that the questions of
> course are not intended to be criticism. This mail is just a dump of doubts
> that probably will help me to better understand the workflows in manifold
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message