manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Reconciliation of documents crawled
Date Tue, 16 Sep 2014 06:48:06 GMT
If you are going to write to a file, you might as well write to the log
file, since that mechanism is already available.

Karl




On Tue, Sep 16, 2014 at 2:44 AM, lalit jangra <lalit.j.jangra@gmail.com>
wrote:

> Thanks Karl,
>
> As compared to all three methods suggested by you, i believe writing to
> file would be easier, correct me if i am wrong.
>
> What i initially thought that while job is running, i need to write
> counter values for each document seeded and processed as we are calling
> addSeedDocument() & processDocument() methods for each document. In this
> case, it would not be easy to reconcile after job is complete as i do have
> loads of data once job finishes and mapping them would be tough. This is
> why i am trying to avoid file based mechanism. Also i would hit the
> tracking issue as we are calling connector object multiple times and having
> multiple agents running parallely.
>
> Please suggest.
>
> Regards.
>
> On Tue, Sep 16, 2014 at 11:59 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Lalit,
>>
>> So, let me clarify: you want some independent measure as to whether every
>> document seeded, per job, has been in fact processed?
>>
>> If that is a correct statement, there is by definition no "in code" way
>> to do it, since there are multiple agents running in your setup. Each agent
>> may process some of the documents, and certainly no agent will process all
>> of them.  Also, restarting any agents process will lose the information you
>> are attempting to record.
>>
>> So you are stuck with three possibilities:
>>
>> The first possibility is to use [INFO] statements written to the log.
>> This would work, but you don't have the information you need in your
>> connector (specifically the job ID), so you would have to add these logging
>> statements to various places in the ManifoldCF framework.
>>
>> The second possibility is to make use of the history database table,
>> where events are recorded.  You could create two new activity types, also
>> written within the framework, for tracking seeding of records and for
>> tracking processing of records.  There are already activity types for job
>> start and end.
>>
>> Finally, the third possibility: If you must absolutely avoid the file
>> system, you would have to write a tracking process which allowed ManifoldCF
>> threads to connect via sockets and communicate document seeding and
>> processing events.  Once again, within the framework, you would transmit
>> events to the recording process.  This system would be at risk of losing
>> tracking data when your tracking process needed to be restarted, however.
>>
>> None of these are trivial to implement.  Essentially, keeping track of
>> documents is what MCF uses the database for in the first place, so this
>> requirement is like insisting that there be a second ManifoldCF there to be
>> sure that the first one did the right thing.  It's an incredible waste of
>> resources, frankly.  Using the log is perhaps the simplest to implement and
>> most consistent with what clients might be expecting, but it has very
>> significant I/O costs.  Using the history table has a similar problem,
>> while also putting your database under load.  The last solution requires a
>> lot of well-constructed code and remains vulnerable to system instability.
>> Take your pick.
>>
>> Karl
>>
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <lalit.j.jangra@gmail.com>
>> wrote:
>>
>>> Greetings ,
>>>
>>> As part of implementation, i need to put a reconciliation mechanism in
>>> place where it can be verified how many documents have been crawled for a
>>> job and same can be displayed in logs.
>>>
>>> First thing came into my mind is to put counters in e.g. CMIS connector
>>> code in addSeed() and proecessDocuments() methods and increase it as we
>>> progress but as i could see for CMIS that CmisRepositoryConnector.java is
>>> getting called for each seeded document to be ingested, these counters are
>>> not accurate. Is there any method where i can persist these counters within
>>> code itself as i do not want to persist them in file system.
>>>
>>> Please suggest.
>>> --
>>> Regards,
>>> Lalit.
>>>
>>
>>
>
>
> --
> Regards,
> Lalit.
>

Mime
View raw message