manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lalit jangra <>
Subject Re: Reconciliation of documents crawled
Date Tue, 16 Sep 2014 06:44:21 GMT
Thanks Karl,

As compared to all three methods suggested by you, i believe writing to
file would be easier, correct me if i am wrong.

What i initially thought that while job is running, i need to write counter
values for each document seeded and processed as we are calling
addSeedDocument() & processDocument() methods for each document. In this
case, it would not be easy to reconcile after job is complete as i do have
loads of data once job finishes and mapping them would be tough. This is
why i am trying to avoid file based mechanism. Also i would hit the
tracking issue as we are calling connector object multiple times and having
multiple agents running parallely.

Please suggest.


On Tue, Sep 16, 2014 at 11:59 AM, Karl Wright <> wrote:

> Hi Lalit,
> So, let me clarify: you want some independent measure as to whether every
> document seeded, per job, has been in fact processed?
> If that is a correct statement, there is by definition no "in code" way to
> do it, since there are multiple agents running in your setup. Each agent
> may process some of the documents, and certainly no agent will process all
> of them.  Also, restarting any agents process will lose the information you
> are attempting to record.
> So you are stuck with three possibilities:
> The first possibility is to use [INFO] statements written to the log.
> This would work, but you don't have the information you need in your
> connector (specifically the job ID), so you would have to add these logging
> statements to various places in the ManifoldCF framework.
> The second possibility is to make use of the history database table, where
> events are recorded.  You could create two new activity types, also written
> within the framework, for tracking seeding of records and for tracking
> processing of records.  There are already activity types for job start and
> end.
> Finally, the third possibility: If you must absolutely avoid the file
> system, you would have to write a tracking process which allowed ManifoldCF
> threads to connect via sockets and communicate document seeding and
> processing events.  Once again, within the framework, you would transmit
> events to the recording process.  This system would be at risk of losing
> tracking data when your tracking process needed to be restarted, however.
> None of these are trivial to implement.  Essentially, keeping track of
> documents is what MCF uses the database for in the first place, so this
> requirement is like insisting that there be a second ManifoldCF there to be
> sure that the first one did the right thing.  It's an incredible waste of
> resources, frankly.  Using the log is perhaps the simplest to implement and
> most consistent with what clients might be expecting, but it has very
> significant I/O costs.  Using the history table has a similar problem,
> while also putting your database under load.  The last solution requires a
> lot of well-constructed code and remains vulnerable to system instability.
> Take your pick.
> Karl
> Thanks,
> Karl
> On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <>
> wrote:
>> Greetings ,
>> As part of implementation, i need to put a reconciliation mechanism in
>> place where it can be verified how many documents have been crawled for a
>> job and same can be displayed in logs.
>> First thing came into my mind is to put counters in e.g. CMIS connector
>> code in addSeed() and proecessDocuments() methods and increase it as we
>> progress but as i could see for CMIS that is
>> getting called for each seeded document to be ingested, these counters are
>> not accurate. Is there any method where i can persist these counters within
>> code itself as i do not want to persist them in file system.
>> Please suggest.
>> --
>> Regards,
>> Lalit.


View raw message