Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 036C319C2D for ; Wed, 13 Apr 2016 17:42:43 +0000 (UTC) Received: (qmail 63790 invoked by uid 500); 13 Apr 2016 17:42:42 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 63745 invoked by uid 500); 13 Apr 2016 17:42:42 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 63731 invoked by uid 99); 13 Apr 2016 17:42:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2016 17:42:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3041FC36D6 for ; Wed, 13 Apr 2016 17:42:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id fPNVphotQWQg for ; Wed, 13 Apr 2016 17:42:40 +0000 (UTC) Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id EE0265F39B for ; Wed, 13 Apr 2016 17:42:38 +0000 (UTC) Received: by mail-ig0-f182.google.com with SMTP id ui10so66406257igc.1 for ; Wed, 13 Apr 2016 10:42:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=9UF0/9n57O+HB97T4xjeJ5vKcTXFXAGmaWnUL5DVeGI=; b=wT5KbNOI80GD7JYF6OvyMKqfube05BuUViab5FlX3vpJAW1zAQ7UjKzhzkD+46juPs FF3dTFsNQyn5Vi13Pm9qvvf/gxrAj0N76kNYpQoAXz90vX3yLjUpOGycX5C1/M9gQkhU 1IWF3E1XsKXU5yaZZbOSp/Uh+CCGsQjcPUczWZRMYvC9LQC+iC7+1ujJLDC27UhEiIGq CaHBAZ0uK8BSXGxkueTc9e8CwqOC/jc3eQH6lYzvUJ1rCmxRw3ZStA7s4hILFhpxhQpi DPmGIelYK/DR1w5uE0ktPW3Ux1+cGOA1IuTKGDd527LitQ8+Vv4mZmE13y0/zqrc9TUL 7nGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=9UF0/9n57O+HB97T4xjeJ5vKcTXFXAGmaWnUL5DVeGI=; b=l16cHAicDoWqCyAoVH03fX5lZ3rpyIjLq53ligG2XiFnCY2aAnKrp4FvXItjkI9v6T Lp9nXNjl/rZ+9Cfv3GVdrk7H39yUB8to/QoBuLVA8V6l9cCcnzn9T4EwcwJ6idg0hYCi O5YNV30k2VbYatfBtYcETFjeUIwHjffJpP+huolET/NxCpi9a7sHe0dar1B0glpKG2rK TeAuMEn4spAmsenAPM+B7TjNex4MFtNuF/pjzm1e4HHK5ObU2t0RBZVgTbOhjJ3rCp1S iziWNnkBr9e+h5f07M5YYksrIyEP3QX1Co0JHX/8oEfrda9VZ8shJbbVSpQ3RH5wuyZP FNCw== X-Gm-Message-State: AOPr4FUVAqRbsuTmiYwH6uZfbyk/dnG0K5U1+kFfajOfAnrknUHF0q0M4rtA7jnGo8KTt44RVGNtWMCRwQG9+g== MIME-Version: 1.0 X-Received: by 10.50.93.138 with SMTP id cu10mr11911169igb.96.1460569357814; Wed, 13 Apr 2016 10:42:37 -0700 (PDT) Received: by 10.107.12.104 with HTTP; Wed, 13 Apr 2016 10:42:37 -0700 (PDT) In-Reply-To: References: Date: Wed, 13 Apr 2016 13:42:37 -0400 Message-ID: Subject: Re: Scheduled ManifoldCF jobs From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=bcaec502d300be16ed0530614dc5 --bcaec502d300be16ed0530614dc5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Radko, I was able to confirm that saving a job when turning on crawling from the start of a schedule window does NOT reset the seeding version. Nor does turning it off or changing the schedule: >>>>>> Jetty started. Starting crawler... NOT setting version field to null NOT setting version field to null NOT setting version field to null <<<<<< The code I used to test this was as follows: >>>>>> if (!isSame) { System.out.println("Setting version field to null"); values.put(seedingVersionField,null); } else { System.out.println("NOT setting version field to null"); } <<<<<< I don't know what to conclude from this. My code here seems to be working perfectly. Karl On Wed, Apr 13, 2016 at 10:59 AM, Karl Wright wrote: > Hi Radko, > > >>>>>> > thanks. I tried the proposed patch but it didn=E2=80=99t work for me. Aft= er a few > more experiments I=E2=80=99ve found a workaround. > <<<<<< > > Hmm. I did not send you a patch. I just offered to create a diagnostic > one. So I don't know quite what you did here. > > >>>>>> > If I set =E2=80=9CStart method=E2=80=9D on =E2=80=9CConnection=E2=80=9D t= ab and save it, it results to > full recrawl. I don=E2=80=99t know why it is behaving this way, I didn=E2= =80=99t have > enough time to look into the source code what is happening when I click > save button. > <<<<<< > > I don't see any code in there that could possibly cause this, but it is > specific enough that I can confirm it (or not). > > >>>>>> > I noticed another interesting thing. I use =E2=80=9CStart at beginning of= schedule > window=E2=80=9D method. If I set the scheduled time for every day at 1am = and I do > this change at 10am, I would expect the jobs starts at 1am next day but > it starts immediately. I think it should work this way for =E2=80=9CStart= even > inside a schedule window=E2=80=9D but for =E2=80=9CStart at beginning of = schedule window=E2=80=9D > the job should start at exact time. Is it correct or is my understanding = to > start methods wrong? > <<<<<< > > Your understanding is correct. But there are integration tests that test > that this is working correctly, so once again I don't know why you are > seeing this and nobody else is. > > Karl > > > > > On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko > wrote: > >> Hi Karl, >> >> thanks. I tried the proposed patch but it didn=E2=80=99t work for me. Af= ter a few >> more experiments I=E2=80=99ve found a workaround. >> >> It works as I expect if: >> >> 1. set schedule time on =E2=80=9CScheduling=E2=80=9D tab in the UI an= d save it >> 2. set =E2=80=9CStart method=E2=80=9D by updating Postgres =E2=80=9Cj= obs=E2=80=9D table (update jobs >> set startmethod=3D'B' where id=3D=E2=80=A6) >> >> If I set =E2=80=9CStart method=E2=80=9D on =E2=80=9CConnection=E2=80=9D = tab and save it, it results to >> full recrawl. I don=E2=80=99t know why it is behaving this way, I didn= =E2=80=99t have >> enough time to look into the source code what is happening when I click >> save button. >> >> I noticed another interesting thing. I use =E2=80=9CStart at beginning o= f >> schedule window=E2=80=9D method. If I set the scheduled time for every d= ay at >> 1am and I do this change at 10am, I would expect the jobs starts at 1am >> next day but it starts immediately. I think it should work this way >> for =E2=80=9CStart even inside a schedule window=E2=80=9D but for =E2=80= =9CStart at beginning of >> schedule window=E2=80=9D the job should start at exact time. Is it corre= ct or is >> my understanding to start methods wrong? >> >> I=E2=80=99m running Manifold 2.1. >> >> Thanks, >> Radko >> >> >> >> From: Karl Wright >> Reply-To: "user@manifoldcf.apache.org" >> Date: Monday 11 April 2016 at 02:22 >> To: "user@manifoldcf.apache.org" >> Subject: Re: Scheduled ManifoldCF jobs >> >> Here's the logic around job save (which is what would be called if you >> updated the schedule): >> >> >>>>>> >> boolean isSame =3D >> pipelineManager.compareRows(id,jobDescription); >> if (!isSame) >> { >> int currentStatus =3D >> stringToStatus((String)row.getValue(statusField)); >> if (currentStatus =3D=3D STATUS_ACTIVE || currentStatu= s =3D=3D >> STATUS_ACTIVESEEDING || >> currentStatus =3D=3D STATUS_ACTIVE_UNINSTALLED || >> currentStatus =3D=3D STATUS_ACTIVESEEDING_UNINSTALLED) >> >> values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNO= WN)); >> } >> >> if (isSame) >> { >> String oldDocSpecXML =3D >> (String)row.getValue(documentSpecField); >> if (!oldDocSpecXML.equals(newXML)) >> isSame =3D false; >> } >> >> if (isSame) >> isSame =3D >> hopFilterManager.compareRows(id,jobDescription); >> >> if (!isSame) >> values.put(seedingVersionField,null); >> <<<<<< >> >> So, changes to the job pipeline, or changes to the document >> specification, or changes to the hop filtering all could reset the >> seedingVersion field, assuming that it is the job save operation that is >> causing the full crawl. At least, that is a good hypothesis. If you th= ink >> that none of these should be firing then we will have to figure out whic= h >> one it is and why. >> >> Unfortunately I don't have a connector I can use locally that uses >> versioning information. I could write a test connector given time but i= t >> would not duplicate your pipeline environment etc. It may be easier for >> you to just try it out in your environment with diagnostics in place. T= his >> code is in JobManager.java, and I will need to know what version of MCF = you >> have deployed. I can create a ticket and attach a patch that has the >> needed diagnostics. Please let me know if that will work for you. >> >> Thanks, >> Karl >> >> >> On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright wrote: >> >>> Even further downstream, it still all looks good: >>> >>> >>>>>> >>> Jetty started. >>> Starting crawler... >>> Scheduled job start; requestMinimum =3D true >>> Starting job with requestMinimum =3D true >>> When starting the job, requestMinimum =3D true >>> <<<<<< >>> >>> So at the moment I am at a loss. >>> >>> Karl >>> >>> >>> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright wrote: >>> >>>> Hi Radko, >>>> >>>> I set the same settings you did and instrumented the code. It records >>>> the minimum job request: >>>> >>>> >>>>>> >>>> Jetty started. >>>> Starting crawler... >>>> Scheduled job start; requestMinimum =3D true >>>> Starting job with requestMinimum =3D true >>>> <<<<<< >>>> >>>> This is the first run of the job, and the first time the schedule has >>>> been used, just in case you are convinced this has something to do wit= h >>>> scheduled vs. non-scheduled job runs. >>>> >>>> I am going to add more instrumentation to see if there is any chance >>>> there's a problem further downstream. >>>> >>>> Karl >>>> >>>> >>>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote: >>>> >>>>> Thanks a lot Karl! >>>>> >>>>> Here are the steps I did: >>>>> >>>>> 1. Run the job manually =E2=80=93 it took a few hours. >>>>> 2. Manually =E2=80=9Cminimal" run the same job =E2=80=93 it was do= ne in a minute >>>>> 3. Setup scheduled =E2=80=9Cminimal=E2=80=9D run =E2=80=93 it took= again a few hours as in >>>>> the first step >>>>> 4. Scheduled runs on the other days were fast as in step 2. >>>>> >>>>> Thanks for your comments, I=E2=80=99ll continue on it on Monday. >>>>> >>>>> Have a nice weekend, >>>>> Radko >>>>> >>>>> >>>>> >>>>> From: Karl Wright >>>>> Reply-To: "user@manifoldcf.apache.org" >>>>> Date: Friday 8 April 2016 at 17:18 >>>>> To: "user@manifoldcf.apache.org" >>>>> Subject: Re: Scheduled ManifoldCF jobs >>>>> >>>>> Also, going back in this thread a bit, let's make sure we are on the >>>>> same page: >>>>> >>>>> >>>>>> >>>>> I want to schedule these jobs for daily runs. I=E2=80=99m experiencin= g that >>>>> the first scheduled run takes the same time as I ran the job for the = first >>>>> time manually. It seems it is recrawling all documents. Next schedule= d runs >>>>> are fast, a few minutes. Is it expected behaviour? >>>>> <<<<<< >>>>> >>>>> If the first scheduled run is a complete crawl (meaning you did not >>>>> select the "Minimal" setting for the schedule record), you *can* expe= ct the >>>>> job to look at all the documents. The reason is because Documentum d= oes >>>>> not give us any information about document deletions. We have to fig= ure >>>>> that out ourselves, and the only way to do it is to look at all the >>>>> individual documents. The documents do not have to actually be crawl= ed, >>>>> but the connector *does* need to at least assemble its version identi= fier >>>>> string, which requires an interaction with Documentum. >>>>> >>>>> So unless you have "Minimal" crawls selected everywhere, which won't >>>>> ever detect deletions, you have to live with the time spent looking f= or >>>>> deletions. We recommend that you do this at least occasionally, but >>>>> certainly you wouldn't want to do it more than a couple times a month= I >>>>> would think. >>>>> >>>>> Hope this helps. >>>>> Karl >>>>> >>>>> >>>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright >>>>> wrote: >>>>> >>>>>> There's one slightly funky thing about the Documentum connector that >>>>>> tries to compensate for clock skew as follows: >>>>>> >>>>>> >>>>>> >>>>>> // There seems to be some unexplained slop in the latest DCTM >>>>>> version. It misses documents depending on how close to the r_modify= _date >>>>>> you happen to be. >>>>>> // So, I've decreased the start time by a full five minutes, t= o >>>>>> insure overlap. >>>>>> if (startTime > 300000L) >>>>>> startTime =3D startTime - 300000L; >>>>>> else >>>>>> startTime =3D 0L; >>>>>> StringBuilder strDQLend =3D new StringBuilder(" where >>>>>> r_modify_date >=3D " + buildDateString(startTime) + >>>>>> " and r_modify_date<=3D" + buildDateString(seedTime) + >>>>>> " AND (i_is_deleted=3DTRUE Or (i_is_deleted=3DFALSE AND >>>>>> a_full_text=3DTRUE AND r_content_size>0"); >>>>>> >>>>>> <<<<<< >>>>>> >>>>>> The 300000 ms adjustment is five minutes, which doesn't seem like a >>>>>> lot but maybe it is affecting your testing? >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright >>>>>> wrote: >>>>>> >>>>>>> Hi Radko, >>>>>>> >>>>>>> There's no magic here; the seedingversion from the database is >>>>>>> passed to the connector method which seeds documents. The only way= this >>>>>>> version gets cleared is if you save the job and the document specif= ication >>>>>>> changes. >>>>>>> >>>>>>> The only other possibility I can think of is that the documentum >>>>>>> connector is ignoring the seedingversion information. I will look = into >>>>>>> this further over the weekend. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> thanks for your clarification. >>>>>>>> >>>>>>>> I=E2=80=99m not changing any document specification information. I= just set >>>>>>>> =E2=80=9CScheduled time=E2=80=9D and =E2=80=9CJob invocation=E2=80= =9D on =E2=80=9CScheduling=E2=80=9D tab, =E2=80=9CStart method=E2=80=9D >>>>>>>> on =E2=80=9CConnection=E2=80=9D tab and click =E2=80=9CSave=E2=80= =9D button. That=E2=80=99s all. >>>>>>>> >>>>>>>> I tried to set all the scheduling information directly in Postres >>>>>>>> database to be sure I didn=E2=80=99t change any document specifica= tion >>>>>>>> information and the result was the same, all documents were >>>>>>>> recrawled. >>>>>>>> >>>>>>>> One more thing I tried was to update =E2=80=9Cseedingversion=E2=80= =9D in =E2=80=9Cjobs=E2=80=9D >>>>>>>> table but again all documents were recrawled. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Radko >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> From: Karl Wright >>>>>>>> Reply-To: "user@manifoldcf.apache.org" >>>>>>>> Date: Friday 1 April 2016 at 14:30 >>>>>>>> To: "user@manifoldcf.apache.org" >>>>>>>> Subject: Re: Scheduled ManifoldCF jobs >>>>>>>> >>>>>>>> Sorry, that response was *almost* incoherent. :-) >>>>>>>> >>>>>>>> Trying again: >>>>>>>> >>>>>>>> As far as how MCF computes incremental changes, it does not matter >>>>>>>> whether a job is run on schedule, or manually. But if you change = certain >>>>>>>> aspects of the job, namely the document specification information,= MCF >>>>>>>> "starts over" at the beginning of time. It needs to do that becau= se you >>>>>>>> might well have made changes to the document specification that co= uld >>>>>>>> change the way documents are indexed. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Radko, >>>>>>>>> >>>>>>>>> For computing how MCF does job crawling, it does not care whether >>>>>>>>> the job is run manually or by schedule. >>>>>>>>> >>>>>>>>> The issue is likely to be that you changed some other detail abou= t >>>>>>>>> the job definition that might have affected how documents are ind= exed. In >>>>>>>>> that case, MCF would cause all documents to be recrawled because = of that. >>>>>>>>> Changes to a job's document specification information will cause = that to be >>>>>>>>> the case. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I have a few jobs crawling documents from Documentum. Some of >>>>>>>>>> these jobs are quite big and the first run of the job takes a fe= w hours or >>>>>>>>>> a day to finish. Then, when I do a =E2=80=9Cminimal run=E2=80=9D= for updates, the job is >>>>>>>>>> usually done in a few minutes. >>>>>>>>>> >>>>>>>>>> I want to schedule these jobs for daily runs. I=E2=80=99m experi= encing >>>>>>>>>> that the first scheduled run takes the same time as I ran the jo= b for the >>>>>>>>>> first time manually. It seems it is recrawling all documents. Ne= xt >>>>>>>>>> scheduled runs are fast, a few minutes. Is it expected behaviour= ? I would >>>>>>>>>> expect the first scheduled run to be fast too because the job wa= s already >>>>>>>>>> finished before by manual start. Is there a way how to don=E2=80= =99t recrawl all >>>>>>>>>> documents in this case, it=E2=80=99s really time consuming opera= tion. >>>>>>>>>> >>>>>>>>>> My settings: >>>>>>>>>> Schedule type: Scan every document once >>>>>>>>>> Job invocation: Minimal >>>>>>>>>> Scheduled time: once a day >>>>>>>>>> Start method: Start when schedule window starts >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Radko >>>>>>>>>> >>>>>>>>> Notice: This e-mail message, together with any attachments, >> contains >> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, >> New Jersey, USA 07033), and/or its affiliates Direct contact information >> for affiliates is available at >> http://www.merck.com/contact/contacts.html) that may be confidential, >> proprietary copyrighted and/or legally privileged. It is intended solely >> for the use of the individual or entity named on this message. If you ar= e >> not the intended recipient, and have received this message in error, >> please notify us immediately by reply e-mail and then delete it from >> your system. >> > > --bcaec502d300be16ed0530614dc5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Radko,

I was able to confirm that sa= ving a job when turning on crawling from the start of a schedule window doe= s NOT reset the seeding version.=C2=A0 Nor does turning it off or changing = the schedule:

>>>>>>
<= div>Jetty started.
Starting crawler...
NOT setting vers= ion field to null
NOT setting version field to null
NOT= setting version field to null
<<<<<<

The code I used to test this was as follows:

>>>>>>
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!isSame) {
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 System.out.println(&qu= ot;Setting version field to null");
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 values.put(seedingVersionField,null)= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } else = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = System.out.println("NOT setting version field to null");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
<<<<<<

I don't know what to= conclude from this.=C2=A0 My code here seems to be working perfectly.
Karl




On Wed, Apr 13, 2016 at 1= 0:59 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Radko,

>>>>>>
thanks. I tried the pr= oposed patch but it didn=E2=80=99t work for me. After a few more experiment= s I=E2=80=99ve found a workaround.
<<<<<<=

Hmm.=C2=A0 I did not send you a patch.= =C2=A0 I just offered to create a diagnostic one.=C2=A0 So I don't know= quite what you did here.

>= ;>>>>>
If=C2=A0I set=C2=A0=E2=80=9CStart method=E2=80=9D on=C2=A0=E2=80=9CConn= ection=E2=80=9D tab and save it, it results to full recrawl.=C2=A0I don=E2= =80=99t know why it is behaving this way, I didn=E2=80=99t have enough time= to look into the source code what is happening when I click save button.
<= ;<<<<<

I don't see any code in t= here that could possibly cause this, but it is specific enough that I can c= onfirm it (or not).

>>>>>>= ;
I noticed another interesting thing.=C2=A0I use=C2=A0=E2=80=9CStart at beg= inning of schedule window=E2=80=9D method. If=C2=A0I set the scheduled time= for every day at=C2=A01am=C2=A0and=C2=A0I do thi= s change at=C2=A010am,=C2=A0I would expect the jo= bs starts at=C2=A01am=C2=A0next day but it starts= immediately. I think it should work this way for=C2=A0=E2=80=9CStart even = inside a schedule window=E2=80=9D but for=C2=A0=E2=80=9CStart at beginning of schedule window=E2=80=9D the job should st= art at exact time. Is it correct or is my understanding to start methods wr= ong?
<<<<<<
=
Your understanding is cor= rect.=C2=A0 But there are integration tests that test that this is working = correctly, so once again I don't know why you are seeing this and nobod= y else is.

<= /font>
Ka= rl



<= div class=3D"h5">

= On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko <radko.najman@merck.c= om> wrote:
Hi Karl,

thanks. I tried the proposed patch but it didn=E2=80=99t work= for me. After a few more experiments I=E2=80=99ve found a workaround.

<= /div>
It works as I e= xpect if:
  1. set schedule time on = =E2=80=9CScheduling=E2=80=9D tab in the UI and save it
  2. set=C2=A0=E2=80=9CStart method=E2=80=9D by updating Postg= res =E2=80=9Cjobs=E2=80=9D table (update jobs set startmethod=3D'B'= where id=3D=E2=80=A6)
If=C2=A0= I set=C2=A0=E2=80=9CStart method=E2=80=9D on=C2=A0=E2=80=9CConnection=E2=80= =9D tab and save it, it results to full recrawl.=C2=A0I don=E2=80=99t know = why it is behaving this way, I didn=E2=80=99t have enough time to look into= the source code what is happening when I click save button.

I no= ticed another interesting thing.=C2=A0I use=C2=A0=E2=80=9CStart at beginning of schedule window=E2=80=9D method. If=C2=A0I set the scheduled time for every day at 1am a= nd=C2=A0I do this change at 10am,=C2=A0I would expect the jobs starts at 1a= m next day but it starts immediately. I think it should work this way for= =C2=A0=E2=80=9CStart even inside a schedule window=E2=80=9D but for=C2=A0=E2=80=9CStart at beginning of s= chedule window=E2=80=9D the job = should start at exact time. Is it correct or is my understanding to start m= ethods wrong?

I=E2=80=99m running Manifold 2.1.

Thanks,
Radko
=


From: Karl Wri= ght <daddywri@gm= ail.com>
Reply-To: "= ;user@manif= oldcf.apache.org" <user@manifoldcf.apache.org>
Date: Monday 11 April 2016 at 02:22To: "user@manifoldcf.apache.org&quo= t; <user= @manifoldcf.apache.org>
Subject:= Re: Scheduled ManifoldCF jobs

<= /div>
Here's the logic around job save (which is what w= ould be called if you updated the schedule):

>>>= ;>>>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 boolean isSame =3D pipelineManager.compareRows(id,jobDescription= );
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!i= sSame)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 in= t currentStatus =3D stringToStatus((String)row.getValue(statusField));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (cu= rrentStatus =3D=3D STATUS_ACTIVE || currentStatus =3D=3D STATUS_ACTIVESEEDI= NG ||
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 currentStatus =3D=3D STATUS_ACTIVE_UNINSTALLED || currentStatus = =3D=3D STATUS_ACTIVESEEDING_UNINSTALLED)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 values.put(assessmentStateFie= ld,assessmentStateToString(ASSESSMENT_UNKNOWN));
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (isSame)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 String oldDocSpecX= ML =3D (String)row.getValue(documentSpecField);
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!oldDocSpecXML.equals(new= XML))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 isSame =3D false;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (isSame)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 isSame =3D hopFilterManager.compareR= ows(id,jobDescription);

=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!isSame)
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 values.put(seedingVersionField,n= ull);
<<<<<<

So,= changes to the job pipeline, or changes to the document specification, or = changes to the hop filtering all could reset the seedingVersion field, assu= ming that it is the job save operation that is causing the full crawl.=C2= =A0 At least, that is a good hypothesis.=C2=A0 If you think that none of th= ese should be firing then we will have to figure out which one it is and wh= y.

Unfortunately I don't have a connector I ca= n use locally that uses versioning information.=C2=A0 I could write a test = connector given time but it would not duplicate your pipeline environment e= tc.=C2=A0 It may be easier for you to just try it out in your environment w= ith diagnostics in place.=C2=A0 This code is in JobManager.java, and I will= need to know what version of MCF you have deployed.=C2=A0 I can create a t= icket and attach a patch that has the needed diagnostics.=C2=A0 Please let = me know if that will work for you.

Thanks,
Karl=


On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright= <daddywri@gmail.com> wrote:
Even further downstream, it stil= l all looks good:

>>>>>>
Jetty started.
Starting crawler...
Scheduled jo= b start; requestMinimum =3D true
Starting job with requestMinimum= =3D true
When starting the job, requestMinimum =3D true
<<<<<<

So at the = moment I am at a loss.

Karl


On Fri, Apr = 8, 2016 at 2:20 PM, Karl Wright <daddywri@gmail.com> wrote:=
H= i Radko,

I set the same settings you did and instrumente= d the code.=C2=A0 It records the minimum job request:

<= div>>>>>>>
Jetty started.
Starti= ng crawler...
Scheduled job start; requestMinimum =3D true
<= div>Starting job with requestMinimum =3D true
<<<&= lt;<<

This is the first run of the job, and = the first time the schedule has been used, just in case you are convinced t= his has something to do with scheduled vs. non-scheduled job runs.

I am going to add more instrumentation to see if there is = any chance there's a problem further downstream.

Karl


On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko=C2=A0wrote:=
Thanks a lo= t Karl!

Here are the steps I did:
  1. Run= the job manually =E2=80=93 it took a few hours.
  2. Manually =E2=80=9C= minimal" run the same job =E2=80=93 it was done in a minute
  3. Se= tup scheduled =E2=80=9Cminimal=E2=80=9D run =E2=80=93 it took again a few h= ours as in the first step
  4. Scheduled runs on the other days were fas= t as in step 2.
Thanks for your comments, I=E2=80=99ll contin= ue on it on Monday.

Have a nice weekend,
Radko



From: Karl = Wright <daddywri= @gmail.com>
Reply-To: &q= uot;user@ma= nifoldcf.apache.org" <user@manifoldcf.apache.org>
Date: Friday 8 April 2016 at 17:18<= br>To: "user@manifoldcf.apache.org&q= uot; <us= er@manifoldcf.apache.org>
Subjec= t: Re: Scheduled ManifoldCF jobs

Also, going back in this thread a bit, let's make s= ure we are on the same page:

>>>>>>
I want to schedule these jobs for daily runs. I=E2=80=99m exper= iencing that the first scheduled run takes the same time as I ran the job f= or the first time manually. It seems it is recrawling all documents. Next s= cheduled runs are fast, a few minutes. Is it expected behaviour?
=
<<<<<<

If the first scheduled run is a complete crawl (meaning you did= not select the "Minimal" setting for the schedule record), you *= can* expect the job to look at all the documents.=C2=A0 The reason is becau= se Documentum does not give us any information about document deletions.=C2= =A0 We have to figure that out ourselves, and the only way to do it is to l= ook at all the individual documents.=C2=A0 The documents do not have to act= ually be crawled, but the connector *does* need to at least assemble its ve= rsion identifier string, which requires an interaction with Documentum.

So unless you have "Minima= l" crawls selected everywhere, which won't ever detect deletions, = you have to live with the time spent looking for deletions.=C2=A0 We recomm= end that you do this at least occasionally, but certainly you wouldn't = want to do it more than a couple times a month I would think.
<= div>
Hope this helps.
Kar= l


On Fri, Apr 8, 2016 at 10:54 AM, = Karl Wright <daddywri@gmail.com> wrote:
There's one slightly funky= thing about the Documentum connector that tries to compensate for clock sk= ew as follows:

>>>>>>
= =C2=A0 =C2=A0 =C2=A0 // There seems to be some unexplained slop in the late= st DCTM version.=C2=A0 It misses documents depending on how close to the r_= modify_date you happen to be.
=C2=A0 =C2=A0 =C2=A0 // So, I'v= e decreased the start time by a full five minutes, to insure overlap.
=
=C2=A0 =C2=A0 =C2=A0 if (startTime > 300000L)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 startTime =3D startTime - 300000L;
=C2=A0 =C2= =A0 =C2=A0 else
=C2=A0 =C2=A0 =C2=A0 =C2=A0 startTime =3D 0L;
=C2=A0 =C2=A0 =C2=A0 StringBuilder strDQLend =3D new StringBuilder(&= quot; where r_modify_date >=3D " + buildDateString(startTime) +
=C2=A0 =C2=A0 =C2=A0 =C2=A0 " and r_modify_date<=3D" + = buildDateString(seedTime) +
=C2=A0 =C2=A0 =C2=A0 =C2=A0 " AN= D (i_is_deleted=3DTRUE Or (i_is_deleted=3DFALSE AND a_full_text=3DTRUE AND = r_content_size>0");

<<<<= <<

The 300000 ms adjustment is five minutes,= which doesn't seem like a lot but maybe it is affecting your testing?<= /div>

Karl

<= /div>

On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright= <daddywri@gmail.com> wrote:
Hi Radko,

There's = no magic here; the seedingversion from the database is passed to the connec= tor method which seeds documents.=C2=A0 The only way this version gets clea= red is if you save the job and the document specification changes.

The only other possibility I can think of is that the docu= mentum connector is ignoring the seedingversion information.=C2=A0 I will l= ook into this further over the weekend.
=

Karl



=


On Fri, Apr 8, 2016 at 10:33 A= M, Najman, Radko=C2=A0wrote:
Hi Karl,

thanks for your clarification.<= /div>
=
I=E2=80=99m not changing any document specification informatio= n. I just set =E2=80=9CScheduled time=E2=80=9D and =E2=80=9CJob invocation= =E2=80=9D on =E2=80=9CScheduling=E2=80=9D tab, =E2=80=9CStart method=E2=80= =9D on =E2=80=9CConnection=E2=80=9D tab and click =E2=80=9CSave=E2=80=9D bu= tton. That=E2=80=99s all.

I tried to set all the scheduling information directly in Postres databa= se to be sure I didn=E2=80=99t change any=C2=A0document specification information=C2=A0and the result was the same, all docu= ments were recrawled.

O= ne more thing I tried was to update =E2=80=9Cseedingversion=E2=80=9D in=C2= =A0=E2=80=9Cjobs=E2=80=9D table but again all documents were recrawled.

=
Thanks,
Radko

=

From: Karl Wright <daddywri@gmail.com>
Reply-To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org&= gt;
Date: Friday 1 April 2016 a= t 14:30
To: "user@manifoldcf.apache.o= rg" <user@manifoldcf.apache.org>
Subject: Re: Scheduled ManifoldCF jobs

Sorry, that response was *almost* incoherent. :-)
=
Trying again:

As far as how MCF com= putes incremental changes, it does not matter whether a job is run on sched= ule, or manually.=C2=A0 But if you change certain aspects of the job, namel= y the document specification information, MCF "starts over" at th= e beginning of time.=C2=A0 It needs to do that because you might well have = made changes to the document specification that could change the way docume= nts are indexed.

Thanks,
Karl
=

On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright &= lt;daddywri@gmail.c= om> wrote:
Hi Radko,

For computing how MCF does job = crawling, it does not care whether the job is run manually or by schedule.<= /div>

The issue is likely to be that you changed some ot= her detail about the job definition that might have affected how documents = are indexed.=C2=A0 In that case, MCF would cause all documents to be recraw= led because of that.=C2=A0 Changes to a job's document specification in= formation will cause that to be the case.

Thanks,<= /div>
Karl
=C2=A0

On Fri, Apr 1, 2016 at 3:4= 0 AM, Najman, Radko wrote:
Hello,

I have a few jobs crawling d= ocuments from Documentum. Some of these jobs are quite big and the first ru= n of the job takes a few hours or a day to finish. Then, when I do a =E2=80= =9Cminimal run=E2=80=9D for updates, the job is usually done in a few minut= es.

I want to schedule these jobs for daily runs. = I=E2=80=99m experiencing that the first scheduled run takes the same time a= s I ran the job for the first time manually. It seems it is recrawling all = documents. Next scheduled runs are fast, a few minutes. Is it expected beha= viour? I would expect the first scheduled run to be fast too because the jo= b was already finished before by manual start. Is there a way how to don=E2= =80=99t recrawl all documents in this case, it=E2=80=99s really time consum= ing operation.

My settings:
Schedule typ= e: Scan every document once
Job invocation: Minimal
Sch= eduled time: once a day
Start method: Start when schedule window = starts

Thank you,
Radko
=

Notice:=C2=A0 This e-mail message, together= with any attachments, contains
information of Merck & Co., Inc. (20= 00 Galloping Hill Road, Kenilworth,
New Jersey, USA 07033), and/or its = affiliates Direct contact information
for affiliates is available at h= ttp://www.merck.com/contact/contacts.html) that may be confidential,proprietary copyrighted and/or legally privileged. It is intended solelyfor the use of the individual or entity named on this message. If you are=
not the intended recipient, and have received this message in error,please notify us immediately by reply e-mail and then delete it from
y= our system.



--bcaec502d300be16ed0530614dc5--