manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scheduled ManifoldCF jobs
Date Fri, 08 Apr 2016 14:54:54 GMT
There's one slightly funky thing about the Documentum connector that tries
to compensate for clock skew as follows:

>>>>>>
      // There seems to be some unexplained slop in the latest DCTM
version.  It misses documents depending on how close to the r_modify_date
you happen to be.
      // So, I've decreased the start time by a full five minutes, to
insure overlap.
      if (startTime > 300000L)
        startTime = startTime - 300000L;
      else
        startTime = 0L;
      StringBuilder strDQLend = new StringBuilder(" where r_modify_date >=
" + buildDateString(startTime) +
        " and r_modify_date<=" + buildDateString(seedTime) +
        " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
a_full_text=TRUE AND r_content_size>0");

<<<<<<

The 300000 ms adjustment is five minutes, which doesn't seem like a lot but
maybe it is affecting your testing?

Karl


On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Radko,
>
> There's no magic here; the seedingversion from the database is passed to
> the connector method which seeds documents.  The only way this version gets
> cleared is if you save the job and the document specification changes.
>
> The only other possibility I can think of is that the documentum connector
> is ignoring the seedingversion information.  I will look into this further
> over the weekend.
>
> Karl
>
>
>
>
>
> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko <radko.najman@merck.com>
> wrote:
>
>> Hi Karl,
>>
>> thanks for your clarification.
>>
>> I’m not changing any document specification information. I just set
>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start method”
>> on “Connection” tab and click “Save” button. That’s all.
>>
>> I tried to set all the scheduling information directly in Postres
>> database to be sure I didn’t change any document specification
>> information and the result was the same, all documents were recrawled.
>>
>> One more thing I tried was to update “seedingversion” in “jobs” table
>> but again all documents were recrawled.
>>
>> Thanks,
>> Radko
>>
>>
>>
>> From: Karl Wright <daddywri@gmail.com>
>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Date: Friday 1 April 2016 at 14:30
>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Subject: Re: Scheduled ManifoldCF jobs
>>
>> Sorry, that response was *almost* incoherent. :-)
>>
>> Trying again:
>>
>> As far as how MCF computes incremental changes, it does not matter
>> whether a job is run on schedule, or manually.  But if you change certain
>> aspects of the job, namely the document specification information, MCF
>> "starts over" at the beginning of time.  It needs to do that because you
>> might well have made changes to the document specification that could
>> change the way documents are indexed.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Radko,
>>>
>>> For computing how MCF does job crawling, it does not care whether the
>>> job is run manually or by schedule.
>>>
>>> The issue is likely to be that you changed some other detail about the
>>> job definition that might have affected how documents are indexed.  In that
>>> case, MCF would cause all documents to be recrawled because of that.
>>> Changes to a job's document specification information will cause that to be
>>> the case.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a few jobs crawling documents from Documentum. Some of these
>>>> jobs are quite big and the first run of the job takes a few hours or a day
>>>> to finish. Then, when I do a “minimal run” for updates, the job is usually
>>>> done in a few minutes.
>>>>
>>>> I want to schedule these jobs for daily runs. I’m experiencing that the
>>>> first scheduled run takes the same time as I ran the job for the first time
>>>> manually. It seems it is recrawling all documents. Next scheduled runs are
>>>> fast, a few minutes. Is it expected behaviour? I would expect the first
>>>> scheduled run to be fast too because the job was already finished before
by
>>>> manual start. Is there a way how to don’t recrawl all documents in this
>>>> case, it’s really time consuming operation.
>>>>
>>>> My settings:
>>>> Schedule type: Scan every document once
>>>> Job invocation: Minimal
>>>> Scheduled time: once a day
>>>> Start method: Start when schedule window starts
>>>>
>>>> Thank you,
>>>> Radko
>>>>
>>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
>> New Jersey, USA 07033), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>
>

Mime
View raw message