manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scheduled ManifoldCF jobs
Date Fri, 08 Apr 2016 18:31:25 GMT
Even further downstream, it still all looks good:

>>>>>>
Jetty started.
Starting crawler...
Scheduled job start; requestMinimum = true
Starting job with requestMinimum = true
When starting the job, requestMinimum = true
<<<<<<

So at the moment I am at a loss.

Karl


On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Radko,
>
> I set the same settings you did and instrumented the code.  It records the
> minimum job request:
>
> >>>>>>
> Jetty started.
> Starting crawler...
> Scheduled job start; requestMinimum = true
> Starting job with requestMinimum = true
> <<<<<<
>
> This is the first run of the job, and the first time the schedule has been
> used, just in case you are convinced this has something to do with
> scheduled vs. non-scheduled job runs.
>
> I am going to add more instrumentation to see if there is any chance
> there's a problem further downstream.
>
> Karl
>
>
> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko <radko.najman@merck.com>
> wrote:
>
>> Thanks a lot Karl!
>>
>> Here are the steps I did:
>>
>>    1. Run the job manually – it took a few hours.
>>    2. Manually “minimal" run the same job – it was done in a minute
>>    3. Setup scheduled “minimal” run – it took again a few hours as in
>>    the first step
>>    4. Scheduled runs on the other days were fast as in step 2.
>>
>> Thanks for your comments, I’ll continue on it on Monday.
>>
>> Have a nice weekend,
>> Radko
>>
>>
>>
>> From: Karl Wright <daddywri@gmail.com>
>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Date: Friday 8 April 2016 at 17:18
>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Subject: Re: Scheduled ManifoldCF jobs
>>
>> Also, going back in this thread a bit, let's make sure we are on the same
>> page:
>>
>> >>>>>>
>> I want to schedule these jobs for daily runs. I’m experiencing that the
>> first scheduled run takes the same time as I ran the job for the first time
>> manually. It seems it is recrawling all documents. Next scheduled runs are
>> fast, a few minutes. Is it expected behaviour?
>> <<<<<<
>>
>> If the first scheduled run is a complete crawl (meaning you did not
>> select the "Minimal" setting for the schedule record), you *can* expect the
>> job to look at all the documents.  The reason is because Documentum does
>> not give us any information about document deletions.  We have to figure
>> that out ourselves, and the only way to do it is to look at all the
>> individual documents.  The documents do not have to actually be crawled,
>> but the connector *does* need to at least assemble its version identifier
>> string, which requires an interaction with Documentum.
>>
>> So unless you have "Minimal" crawls selected everywhere, which won't ever
>> detect deletions, you have to live with the time spent looking for
>> deletions.  We recommend that you do this at least occasionally, but
>> certainly you wouldn't want to do it more than a couple times a month I
>> would think.
>>
>> Hope this helps.
>> Karl
>>
>>
>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> There's one slightly funky thing about the Documentum connector that
>>> tries to compensate for clock skew as follows:
>>>
>>> >>>>>>
>>>       // There seems to be some unexplained slop in the latest DCTM
>>> version.  It misses documents depending on how close to the r_modify_date
>>> you happen to be.
>>>       // So, I've decreased the start time by a full five minutes, to
>>> insure overlap.
>>>       if (startTime > 300000L)
>>>         startTime = startTime - 300000L;
>>>       else
>>>         startTime = 0L;
>>>       StringBuilder strDQLend = new StringBuilder(" where r_modify_date
>>> >= " + buildDateString(startTime) +
>>>         " and r_modify_date<=" + buildDateString(seedTime) +
>>>         " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>> a_full_text=TRUE AND r_content_size>0");
>>>
>>> <<<<<<
>>>
>>> The 300000 ms adjustment is five minutes, which doesn't seem like a lot
>>> but maybe it is affecting your testing?
>>>
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Radko,
>>>>
>>>> There's no magic here; the seedingversion from the database is passed
>>>> to the connector method which seeds documents.  The only way this version
>>>> gets cleared is if you save the job and the document specification changes.
>>>>
>>>> The only other possibility I can think of is that the documentum
>>>> connector is ignoring the seedingversion information.  I will look into
>>>> this further over the weekend.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> thanks for your clarification.
>>>>>
>>>>> I’m not changing any document specification information. I just set
>>>>> “Scheduled time” and “Job invocation” on “Scheduling” tab,
“Start method”
>>>>> on “Connection” tab and click “Save” button. That’s all.
>>>>>
>>>>> I tried to set all the scheduling information directly in Postres
>>>>> database to be sure I didn’t change any document specification
>>>>> information and the result was the same, all documents were recrawled.
>>>>>
>>>>> One more thing I tried was to update “seedingversion” in “jobs”
table
>>>>> but again all documents were recrawled.
>>>>>
>>>>> Thanks,
>>>>> Radko
>>>>>
>>>>>
>>>>>
>>>>> From: Karl Wright <daddywri@gmail.com>
>>>>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Date: Friday 1 April 2016 at 14:30
>>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>
>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>
>>>>> Trying again:
>>>>>
>>>>> As far as how MCF computes incremental changes, it does not matter
>>>>> whether a job is run on schedule, or manually.  But if you change certain
>>>>> aspects of the job, namely the document specification information, MCF
>>>>> "starts over" at the beginning of time.  It needs to do that because
you
>>>>> might well have made changes to the document specification that could
>>>>> change the way documents are indexed.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Radko,
>>>>>>
>>>>>> For computing how MCF does job crawling, it does not care whether
the
>>>>>> job is run manually or by schedule.
>>>>>>
>>>>>> The issue is likely to be that you changed some other detail about
>>>>>> the job definition that might have affected how documents are indexed.
 In
>>>>>> that case, MCF would cause all documents to be recrawled because
of that.
>>>>>> Changes to a job's document specification information will cause
that to be
>>>>>> the case.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a few jobs crawling documents from Documentum. Some of
these
>>>>>>> jobs are quite big and the first run of the job takes a few hours
or a day
>>>>>>> to finish. Then, when I do a “minimal run” for updates, the
job is usually
>>>>>>> done in a few minutes.
>>>>>>>
>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing
that
>>>>>>> the first scheduled run takes the same time as I ran the job
for the first
>>>>>>> time manually. It seems it is recrawling all documents. Next
scheduled runs
>>>>>>> are fast, a few minutes. Is it expected behaviour? I would expect
the first
>>>>>>> scheduled run to be fast too because the job was already finished
before by
>>>>>>> manual start. Is there a way how to don’t recrawl all documents
in this
>>>>>>> case, it’s really time consuming operation.
>>>>>>>
>>>>>>> My settings:
>>>>>>> Schedule type: Scan every document once
>>>>>>> Job invocation: Minimal
>>>>>>> Scheduled time: once a day
>>>>>>> Start method: Start when schedule window starts
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Radko
>>>>>>>
>>>>>>
>>>>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
>> New Jersey, USA 07033), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>
>

Mime
View raw message