Hi Lalit,

So, let me clarify: you want some independent measure as to whether every document seeded, per job, has been in fact processed?

If that is a correct statement, there is by definition no "in code" way to do it, since there are multiple agents running in your setup. Each agent may process some of the documents, and certainly no agent will process all of them.  Also, restarting any agents process will lose the information you are attempting to record.

So you are stuck with three possibilities:

The first possibility is to use [INFO] statements written to the log.  This would work, but you don't have the information you need in your connector (specifically the job ID), so you would have to add these logging statements to various places in the ManifoldCF framework.

The second possibility is to make use of the history database table, where events are recorded.  You could create two new activity types, also written within the framework, for tracking seeding of records and for tracking processing of records.  There are already activity types for job start and end.

Finally, the third possibility: If you must absolutely avoid the file system, you would have to write a tracking process which allowed ManifoldCF threads to connect via sockets and communicate document seeding and processing events.  Once again, within the framework, you would transmit events to the recording process.  This system would be at risk of losing tracking data when your tracking process needed to be restarted, however.

None of these are trivial to implement.  Essentially, keeping track of documents is what MCF uses the database for in the first place, so this requirement is like insisting that there be a second ManifoldCF there to be sure that the first one did the right thing.  It's an incredible waste of resources, frankly.  Using the log is perhaps the simplest to implement and most consistent with what clients might be expecting, but it has very significant I/O costs.  Using the history table has a similar problem, while also putting your database under load.  The last solution requires a lot of well-constructed code and remains vulnerable to system instability.  Take your pick.



On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <lalit.j.jangra@gmail.com> wrote:
Greetings ,

As part of implementation, i need to put a reconciliation mechanism in place where it can be verified how many documents have been crawled for a job and same can be displayed in logs.

First thing came into my mind is to put counters in e.g. CMIS connector code in addSeed() and proecessDocuments() methods and increase it as we progress but as i could see for CMIS that CmisRepositoryConnector.java is getting called for each seeded document to be ingested, these counters are not accurate. Is there any method where i can persist these counters within code itself as i do not want to persist them in file system.

Please suggest.