manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scheduled ManifoldCF jobs
Date Wed, 13 Apr 2016 17:42:37 GMT
Hi Radko,

I was able to confirm that saving a job when turning on crawling from the
start of a schedule window does NOT reset the seeding version.  Nor does
turning it off or changing the schedule:

>>>>>>
Jetty started.
Starting crawler...
NOT setting version field to null
NOT setting version field to null
NOT setting version field to null
<<<<<<

The code I used to test this was as follows:

>>>>>>
                if (!isSame) {
                  System.out.println("Setting version field to null");
                  values.put(seedingVersionField,null);
                } else {
                  System.out.println("NOT setting version field to null");
                }
<<<<<<

I don't know what to conclude from this.  My code here seems to be working
perfectly.
Karl




On Wed, Apr 13, 2016 at 10:59 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Radko,
>
> >>>>>>
> thanks. I tried the proposed patch but it didn’t work for me. After a few
> more experiments I’ve found a workaround.
> <<<<<<
>
> Hmm.  I did not send you a patch.  I just offered to create a diagnostic
> one.  So I don't know quite what you did here.
>
> >>>>>>
> If I set “Start method” on “Connection” tab and save it, it results to
> full recrawl. I don’t know why it is behaving this way, I didn’t have
> enough time to look into the source code what is happening when I click
> save button.
> <<<<<<
>
> I don't see any code in there that could possibly cause this, but it is
> specific enough that I can confirm it (or not).
>
> >>>>>>
> I noticed another interesting thing. I use “Start at beginning of schedule
> window” method. If I set the scheduled time for every day at 1am and I do
> this change at 10am, I would expect the jobs starts at 1am next day but
> it starts immediately. I think it should work this way for “Start even
> inside a schedule window” but for “Start at beginning of schedule window”
> the job should start at exact time. Is it correct or is my understanding to
> start methods wrong?
> <<<<<<
>
> Your understanding is correct.  But there are integration tests that test
> that this is working correctly, so once again I don't know why you are
> seeing this and nobody else is.
>
> Karl
>
>
>
>
> On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko <radko.najman@merck.com>
> wrote:
>
>> Hi Karl,
>>
>> thanks. I tried the proposed patch but it didn’t work for me. After a few
>> more experiments I’ve found a workaround.
>>
>> It works as I expect if:
>>
>>    1. set schedule time on “Scheduling” tab in the UI and save it
>>    2. set “Start method” by updating Postgres “jobs” table (update jobs
>>    set startmethod='B' where id=…)
>>
>> If I set “Start method” on “Connection” tab and save it, it results to
>> full recrawl. I don’t know why it is behaving this way, I didn’t have
>> enough time to look into the source code what is happening when I click
>> save button.
>>
>> I noticed another interesting thing. I use “Start at beginning of
>> schedule window” method. If I set the scheduled time for every day at
>> 1am and I do this change at 10am, I would expect the jobs starts at 1am
>> next day but it starts immediately. I think it should work this way
>> for “Start even inside a schedule window” but for “Start at beginning of
>> schedule window” the job should start at exact time. Is it correct or is
>> my understanding to start methods wrong?
>>
>> I’m running Manifold 2.1.
>>
>> Thanks,
>> Radko
>>
>>
>>
>> From: Karl Wright <daddywri@gmail.com>
>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Date: Monday 11 April 2016 at 02:22
>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Subject: Re: Scheduled ManifoldCF jobs
>>
>> Here's the logic around job save (which is what would be called if you
>> updated the schedule):
>>
>> >>>>>>
>>                 boolean isSame =
>> pipelineManager.compareRows(id,jobDescription);
>>                 if (!isSame)
>>                 {
>>                   int currentStatus =
>> stringToStatus((String)row.getValue(statusField));
>>                   if (currentStatus == STATUS_ACTIVE || currentStatus ==
>> STATUS_ACTIVESEEDING ||
>>                     currentStatus == STATUS_ACTIVE_UNINSTALLED ||
>> currentStatus == STATUS_ACTIVESEEDING_UNINSTALLED)
>>
>> values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN));
>>                 }
>>
>>                 if (isSame)
>>                 {
>>                   String oldDocSpecXML =
>> (String)row.getValue(documentSpecField);
>>                   if (!oldDocSpecXML.equals(newXML))
>>                     isSame = false;
>>                 }
>>
>>                 if (isSame)
>>                   isSame =
>> hopFilterManager.compareRows(id,jobDescription);
>>
>>                 if (!isSame)
>>                   values.put(seedingVersionField,null);
>> <<<<<<
>>
>> So, changes to the job pipeline, or changes to the document
>> specification, or changes to the hop filtering all could reset the
>> seedingVersion field, assuming that it is the job save operation that is
>> causing the full crawl.  At least, that is a good hypothesis.  If you think
>> that none of these should be firing then we will have to figure out which
>> one it is and why.
>>
>> Unfortunately I don't have a connector I can use locally that uses
>> versioning information.  I could write a test connector given time but it
>> would not duplicate your pipeline environment etc.  It may be easier for
>> you to just try it out in your environment with diagnostics in place.  This
>> code is in JobManager.java, and I will need to know what version of MCF you
>> have deployed.  I can create a ticket and attach a patch that has the
>> needed diagnostics.  Please let me know if that will work for you.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Even further downstream, it still all looks good:
>>>
>>> >>>>>>
>>> Jetty started.
>>> Starting crawler...
>>> Scheduled job start; requestMinimum = true
>>> Starting job with requestMinimum = true
>>> When starting the job, requestMinimum = true
>>> <<<<<<
>>>
>>> So at the moment I am at a loss.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Radko,
>>>>
>>>> I set the same settings you did and instrumented the code.  It records
>>>> the minimum job request:
>>>>
>>>> >>>>>>
>>>> Jetty started.
>>>> Starting crawler...
>>>> Scheduled job start; requestMinimum = true
>>>> Starting job with requestMinimum = true
>>>> <<<<<<
>>>>
>>>> This is the first run of the job, and the first time the schedule has
>>>> been used, just in case you are convinced this has something to do with
>>>> scheduled vs. non-scheduled job runs.
>>>>
>>>> I am going to add more instrumentation to see if there is any chance
>>>> there's a problem further downstream.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote:
>>>>
>>>>> Thanks a lot Karl!
>>>>>
>>>>> Here are the steps I did:
>>>>>
>>>>>    1. Run the job manually – it took a few hours.
>>>>>    2. Manually “minimal" run the same job – it was done in a minute
>>>>>    3. Setup scheduled “minimal” run – it took again a few hours
as in
>>>>>    the first step
>>>>>    4. Scheduled runs on the other days were fast as in step 2.
>>>>>
>>>>> Thanks for your comments, I’ll continue on it on Monday.
>>>>>
>>>>> Have a nice weekend,
>>>>> Radko
>>>>>
>>>>>
>>>>>
>>>>> From: Karl Wright <daddywri@gmail.com>
>>>>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Date: Friday 8 April 2016 at 17:18
>>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>
>>>>> Also, going back in this thread a bit, let's make sure we are on the
>>>>> same page:
>>>>>
>>>>> >>>>>>
>>>>> I want to schedule these jobs for daily runs. I’m experiencing that
>>>>> the first scheduled run takes the same time as I ran the job for the
first
>>>>> time manually. It seems it is recrawling all documents. Next scheduled
runs
>>>>> are fast, a few minutes. Is it expected behaviour?
>>>>> <<<<<<
>>>>>
>>>>> If the first scheduled run is a complete crawl (meaning you did not
>>>>> select the "Minimal" setting for the schedule record), you *can* expect
the
>>>>> job to look at all the documents.  The reason is because Documentum does
>>>>> not give us any information about document deletions.  We have to figure
>>>>> that out ourselves, and the only way to do it is to look at all the
>>>>> individual documents.  The documents do not have to actually be crawled,
>>>>> but the connector *does* need to at least assemble its version identifier
>>>>> string, which requires an interaction with Documentum.
>>>>>
>>>>> So unless you have "Minimal" crawls selected everywhere, which won't
>>>>> ever detect deletions, you have to live with the time spent looking for
>>>>> deletions.  We recommend that you do this at least occasionally, but
>>>>> certainly you wouldn't want to do it more than a couple times a month
I
>>>>> would think.
>>>>>
>>>>> Hope this helps.
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There's one slightly funky thing about the Documentum connector that
>>>>>> tries to compensate for clock skew as follows:
>>>>>>
>>>>>> >>>>>>
>>>>>>       // There seems to be some unexplained slop in the latest DCTM
>>>>>> version.  It misses documents depending on how close to the r_modify_date
>>>>>> you happen to be.
>>>>>>       // So, I've decreased the start time by a full five minutes,
to
>>>>>> insure overlap.
>>>>>>       if (startTime > 300000L)
>>>>>>         startTime = startTime - 300000L;
>>>>>>       else
>>>>>>         startTime = 0L;
>>>>>>       StringBuilder strDQLend = new StringBuilder(" where
>>>>>> r_modify_date >= " + buildDateString(startTime) +
>>>>>>         " and r_modify_date<=" + buildDateString(seedTime) +
>>>>>>         " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>>>>> a_full_text=TRUE AND r_content_size>0");
>>>>>>
>>>>>> <<<<<<
>>>>>>
>>>>>> The 300000 ms adjustment is five minutes, which doesn't seem like
a
>>>>>> lot but maybe it is affecting your testing?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Radko,
>>>>>>>
>>>>>>> There's no magic here; the seedingversion from the database is
>>>>>>> passed to the connector method which seeds documents.  The only
way this
>>>>>>> version gets cleared is if you save the job and the document
specification
>>>>>>> changes.
>>>>>>>
>>>>>>> The only other possibility I can think of is that the documentum
>>>>>>> connector is ignoring the seedingversion information.  I will
look into
>>>>>>> this further over the weekend.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> thanks for your clarification.
>>>>>>>>
>>>>>>>> I’m not changing any document specification information.
I just set
>>>>>>>> “Scheduled time” and “Job invocation” on “Scheduling”
tab, “Start method”
>>>>>>>> on “Connection” tab and click “Save” button. That’s
all.
>>>>>>>>
>>>>>>>> I tried to set all the scheduling information directly in
Postres
>>>>>>>> database to be sure I didn’t change any document specification
>>>>>>>> information and the result was the same, all documents were
>>>>>>>> recrawled.
>>>>>>>>
>>>>>>>> One more thing I tried was to update “seedingversion”
in “jobs”
>>>>>>>> table but again all documents were recrawled.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Radko
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Karl Wright <daddywri@gmail.com>
>>>>>>>> Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>>>> Date: Friday 1 April 2016 at 14:30
>>>>>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>>>>
>>>>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>>>>
>>>>>>>> Trying again:
>>>>>>>>
>>>>>>>> As far as how MCF computes incremental changes, it does not
matter
>>>>>>>> whether a job is run on schedule, or manually.  But if you
change certain
>>>>>>>> aspects of the job, namely the document specification information,
MCF
>>>>>>>> "starts over" at the beginning of time.  It needs to do that
because you
>>>>>>>> might well have made changes to the document specification
that could
>>>>>>>> change the way documents are indexed.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Radko,
>>>>>>>>>
>>>>>>>>> For computing how MCF does job crawling, it does not
care whether
>>>>>>>>> the job is run manually or by schedule.
>>>>>>>>>
>>>>>>>>> The issue is likely to be that you changed some other
detail about
>>>>>>>>> the job definition that might have affected how documents
are indexed.  In
>>>>>>>>> that case, MCF would cause all documents to be recrawled
because of that.
>>>>>>>>> Changes to a job's document specification information
will cause that to be
>>>>>>>>> the case.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have a few jobs crawling documents from Documentum.
Some of
>>>>>>>>>> these jobs are quite big and the first run of the
job takes a few hours or
>>>>>>>>>> a day to finish. Then, when I do a “minimal run”
for updates, the job is
>>>>>>>>>> usually done in a few minutes.
>>>>>>>>>>
>>>>>>>>>> I want to schedule these jobs for daily runs. I’m
experiencing
>>>>>>>>>> that the first scheduled run takes the same time
as I ran the job for the
>>>>>>>>>> first time manually. It seems it is recrawling all
documents. Next
>>>>>>>>>> scheduled runs are fast, a few minutes. Is it expected
behaviour? I would
>>>>>>>>>> expect the first scheduled run to be fast too because
the job was already
>>>>>>>>>> finished before by manual start. Is there a way how
to don’t recrawl all
>>>>>>>>>> documents in this case, it’s really time consuming
operation.
>>>>>>>>>>
>>>>>>>>>> My settings:
>>>>>>>>>> Schedule type: Scan every document once
>>>>>>>>>> Job invocation: Minimal
>>>>>>>>>> Scheduled time: once a day
>>>>>>>>>> Start method: Start when schedule window starts
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Radko
>>>>>>>>>>
>>>>>>>>> Notice:  This e-mail message, together with any attachments,
>> contains
>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
>> New Jersey, USA 07033), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>
>

Mime
View raw message