manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scheduled ManifoldCF jobs
Date Wed, 13 Apr 2016 14:59:28 GMT
Hi Radko,

>>>>>>
thanks. I tried the proposed patch but it didn’t work for me. After a few
more experiments I’ve found a workaround.
<<<<<<

Hmm.  I did not send you a patch.  I just offered to create a diagnostic
one.  So I don't know quite what you did here.

>>>>>>
If I set “Start method” on “Connection” tab and save it, it results to full
recrawl. I don’t know why it is behaving this way, I didn’t have enough
time to look into the source code what is happening when I click save
button.
<<<<<<

I don't see any code in there that could possibly cause this, but it is
specific enough that I can confirm it (or not).

>>>>>>
I noticed another interesting thing. I use “Start at beginning of schedule
window” method. If I set the scheduled time for every day at 1am and I do
this change at 10am, I would expect the jobs starts at 1am next day but it
starts immediately. I think it should work this way for “Start even inside
a schedule window” but for “Start at beginning of schedule window” the job
should start at exact time. Is it correct or is my understanding to start
methods wrong?
<<<<<<

Your understanding is correct.  But there are integration tests that test
that this is working correctly, so once again I don't know why you are
seeing this and nobody else is.

Karl




On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko <radko.najman@merck.com>
wrote:

> Hi Karl,
>
> thanks. I tried the proposed patch but it didn’t work for me. After a few
> more experiments I’ve found a workaround.
>
> It works as I expect if:
>
>    1. set schedule time on “Scheduling” tab in the UI and save it
>    2. set “Start method” by updating Postgres “jobs” table (update jobs
>    set startmethod='B' where id=…)
>
> If I set “Start method” on “Connection” tab and save it, it results to
> full recrawl. I don’t know why it is behaving this way, I didn’t have
> enough time to look into the source code what is happening when I click
> save button.
>
> I noticed another interesting thing. I use “Start at beginning of
> schedule window” method. If I set the scheduled time for every day at 1am
> and I do this change at 10am, I would expect the jobs starts at 1am next
> day but it starts immediately. I think it should work this way for “Start
> even inside a schedule window” but for “Start at beginning of schedule
> window” the job should start at exact time. Is it correct or is my
> understanding to start methods wrong?
>
> I’m running Manifold 2.1.
>
> Thanks,
> Radko
>
>
>
> From: Karl Wright <daddywri@gmail.com>
> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> Date: Monday 11 April 2016 at 02:22
> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> Subject: Re: Scheduled ManifoldCF jobs
>
> Here's the logic around job save (which is what would be called if you
> updated the schedule):
>
> >>>>>>
>                 boolean isSame =
> pipelineManager.compareRows(id,jobDescription);
>                 if (!isSame)
>                 {
>                   int currentStatus =
> stringToStatus((String)row.getValue(statusField));
>                   if (currentStatus == STATUS_ACTIVE || currentStatus ==
> STATUS_ACTIVESEEDING ||
>                     currentStatus == STATUS_ACTIVE_UNINSTALLED ||
> currentStatus == STATUS_ACTIVESEEDING_UNINSTALLED)
>
> values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN));
>                 }
>
>                 if (isSame)
>                 {
>                   String oldDocSpecXML =
> (String)row.getValue(documentSpecField);
>                   if (!oldDocSpecXML.equals(newXML))
>                     isSame = false;
>                 }
>
>                 if (isSame)
>                   isSame = hopFilterManager.compareRows(id,jobDescription);
>
>                 if (!isSame)
>                   values.put(seedingVersionField,null);
> <<<<<<
>
> So, changes to the job pipeline, or changes to the document specification,
> or changes to the hop filtering all could reset the seedingVersion field,
> assuming that it is the job save operation that is causing the full crawl.
> At least, that is a good hypothesis.  If you think that none of these
> should be firing then we will have to figure out which one it is and why.
>
> Unfortunately I don't have a connector I can use locally that uses
> versioning information.  I could write a test connector given time but it
> would not duplicate your pipeline environment etc.  It may be easier for
> you to just try it out in your environment with diagnostics in place.  This
> code is in JobManager.java, and I will need to know what version of MCF you
> have deployed.  I can create a ticket and attach a patch that has the
> needed diagnostics.  Please let me know if that will work for you.
>
> Thanks,
> Karl
>
>
> On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Even further downstream, it still all looks good:
>>
>> >>>>>>
>> Jetty started.
>> Starting crawler...
>> Scheduled job start; requestMinimum = true
>> Starting job with requestMinimum = true
>> When starting the job, requestMinimum = true
>> <<<<<<
>>
>> So at the moment I am at a loss.
>>
>> Karl
>>
>>
>> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Radko,
>>>
>>> I set the same settings you did and instrumented the code.  It records
>>> the minimum job request:
>>>
>>> >>>>>>
>>> Jetty started.
>>> Starting crawler...
>>> Scheduled job start; requestMinimum = true
>>> Starting job with requestMinimum = true
>>> <<<<<<
>>>
>>> This is the first run of the job, and the first time the schedule has
>>> been used, just in case you are convinced this has something to do with
>>> scheduled vs. non-scheduled job runs.
>>>
>>> I am going to add more instrumentation to see if there is any chance
>>> there's a problem further downstream.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote:
>>>
>>>> Thanks a lot Karl!
>>>>
>>>> Here are the steps I did:
>>>>
>>>>    1. Run the job manually – it took a few hours.
>>>>    2. Manually “minimal" run the same job – it was done in a minute
>>>>    3. Setup scheduled “minimal” run – it took again a few hours as
in
>>>>    the first step
>>>>    4. Scheduled runs on the other days were fast as in step 2.
>>>>
>>>> Thanks for your comments, I’ll continue on it on Monday.
>>>>
>>>> Have a nice weekend,
>>>> Radko
>>>>
>>>>
>>>>
>>>> From: Karl Wright <daddywri@gmail.com>
>>>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>> Date: Friday 8 April 2016 at 17:18
>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>
>>>> Also, going back in this thread a bit, let's make sure we are on the
>>>> same page:
>>>>
>>>> >>>>>>
>>>> I want to schedule these jobs for daily runs. I’m experiencing that the
>>>> first scheduled run takes the same time as I ran the job for the first time
>>>> manually. It seems it is recrawling all documents. Next scheduled runs are
>>>> fast, a few minutes. Is it expected behaviour?
>>>> <<<<<<
>>>>
>>>> If the first scheduled run is a complete crawl (meaning you did not
>>>> select the "Minimal" setting for the schedule record), you *can* expect the
>>>> job to look at all the documents.  The reason is because Documentum does
>>>> not give us any information about document deletions.  We have to figure
>>>> that out ourselves, and the only way to do it is to look at all the
>>>> individual documents.  The documents do not have to actually be crawled,
>>>> but the connector *does* need to at least assemble its version identifier
>>>> string, which requires an interaction with Documentum.
>>>>
>>>> So unless you have "Minimal" crawls selected everywhere, which won't
>>>> ever detect deletions, you have to live with the time spent looking for
>>>> deletions.  We recommend that you do this at least occasionally, but
>>>> certainly you wouldn't want to do it more than a couple times a month I
>>>> would think.
>>>>
>>>> Hope this helps.
>>>> Karl
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> There's one slightly funky thing about the Documentum connector that
>>>>> tries to compensate for clock skew as follows:
>>>>>
>>>>> >>>>>>
>>>>>       // There seems to be some unexplained slop in the latest DCTM
>>>>> version.  It misses documents depending on how close to the r_modify_date
>>>>> you happen to be.
>>>>>       // So, I've decreased the start time by a full five minutes, to
>>>>> insure overlap.
>>>>>       if (startTime > 300000L)
>>>>>         startTime = startTime - 300000L;
>>>>>       else
>>>>>         startTime = 0L;
>>>>>       StringBuilder strDQLend = new StringBuilder(" where
>>>>> r_modify_date >= " + buildDateString(startTime) +
>>>>>         " and r_modify_date<=" + buildDateString(seedTime) +
>>>>>         " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>>>> a_full_text=TRUE AND r_content_size>0");
>>>>>
>>>>> <<<<<<
>>>>>
>>>>> The 300000 ms adjustment is five minutes, which doesn't seem like a
>>>>> lot but maybe it is affecting your testing?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Radko,
>>>>>>
>>>>>> There's no magic here; the seedingversion from the database is passed
>>>>>> to the connector method which seeds documents.  The only way this
version
>>>>>> gets cleared is if you save the job and the document specification
changes.
>>>>>>
>>>>>> The only other possibility I can think of is that the documentum
>>>>>> connector is ignoring the seedingversion information.  I will look
into
>>>>>> this further over the weekend.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> thanks for your clarification.
>>>>>>>
>>>>>>> I’m not changing any document specification information. I
just set
>>>>>>> “Scheduled time” and “Job invocation” on “Scheduling”
tab, “Start method”
>>>>>>> on “Connection” tab and click “Save” button. That’s
all.
>>>>>>>
>>>>>>> I tried to set all the scheduling information directly in Postres
>>>>>>> database to be sure I didn’t change any document specification
>>>>>>> information and the result was the same, all documents were
>>>>>>> recrawled.
>>>>>>>
>>>>>>> One more thing I tried was to update “seedingversion” in
“jobs”
>>>>>>> table but again all documents were recrawled.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Radko
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: Karl Wright <daddywri@gmail.com>
>>>>>>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>>> Date: Friday 1 April 2016 at 14:30
>>>>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>>>
>>>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>>>
>>>>>>> Trying again:
>>>>>>>
>>>>>>> As far as how MCF computes incremental changes, it does not matter
>>>>>>> whether a job is run on schedule, or manually.  But if you change
certain
>>>>>>> aspects of the job, namely the document specification information,
MCF
>>>>>>> "starts over" at the beginning of time.  It needs to do that
because you
>>>>>>> might well have made changes to the document specification that
could
>>>>>>> change the way documents are indexed.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Radko,
>>>>>>>>
>>>>>>>> For computing how MCF does job crawling, it does not care
whether
>>>>>>>> the job is run manually or by schedule.
>>>>>>>>
>>>>>>>> The issue is likely to be that you changed some other detail
about
>>>>>>>> the job definition that might have affected how documents
are indexed.  In
>>>>>>>> that case, MCF would cause all documents to be recrawled
because of that.
>>>>>>>> Changes to a job's document specification information will
cause that to be
>>>>>>>> the case.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I have a few jobs crawling documents from Documentum.
Some of
>>>>>>>>> these jobs are quite big and the first run of the job
takes a few hours or
>>>>>>>>> a day to finish. Then, when I do a “minimal run”
for updates, the job is
>>>>>>>>> usually done in a few minutes.
>>>>>>>>>
>>>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing
>>>>>>>>> that the first scheduled run takes the same time as I
ran the job for the
>>>>>>>>> first time manually. It seems it is recrawling all documents.
Next
>>>>>>>>> scheduled runs are fast, a few minutes. Is it expected
behaviour? I would
>>>>>>>>> expect the first scheduled run to be fast too because
the job was already
>>>>>>>>> finished before by manual start. Is there a way how to
don’t recrawl all
>>>>>>>>> documents in this case, it’s really time consuming
operation.
>>>>>>>>>
>>>>>>>>> My settings:
>>>>>>>>> Schedule type: Scan every document once
>>>>>>>>> Job invocation: Minimal
>>>>>>>>> Scheduled time: once a day
>>>>>>>>> Start method: Start when schedule window starts
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Radko
>>>>>>>>>
>>>>>>>> Notice:  This e-mail message, together with any attachments,
> contains
> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> New Jersey, USA 07033), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>

Mime
View raw message