Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8848610F6A for ; Wed, 5 Feb 2014 13:41:27 +0000 (UTC) Received: (qmail 12487 invoked by uid 500); 5 Feb 2014 13:41:26 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 12306 invoked by uid 500); 5 Feb 2014 13:41:19 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 12298 invoked by uid 99); 5 Feb 2014 13:41:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Feb 2014 13:41:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.213.42 as permitted sender) Received: from [209.85.213.42] (HELO mail-yh0-f42.google.com) (209.85.213.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Feb 2014 13:41:13 +0000 Received: by mail-yh0-f42.google.com with SMTP id a41so354218yho.1 for ; Wed, 05 Feb 2014 05:40:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Tb8Yl/nwOw55wichR+Yqpid9YMOu7zpxVcYBaPMJpWY=; b=EZdYEieaJbv4njhmjBc3fT9oP8lKwk4tUrWDRuBu7UCtuZoK8/IhxWuuQp7/6Z9oKx mmtW04yn/lU2FNB2vfxp8rt76QZKSg7zfHn7fDFEUv0xHtay5z6wlP7bsrQfREK1oMUm GIJf7ZfODvMmPB3QqpLLT1mgFREoiP10Ctr3F8VHw6A5TBwdPiaTclax03LQyplFd1YU xyn5uPRpZydJww/B+7SSnIK2ADpU+hM+wnjjjWDA5+XpLnr0QNKaa6NO2TJuH9lkeCOY C/ABsxHzJLYu78eP5iUQsmu8lrVraKWu3f4zx/dQKf2z2Knklr8WQnESxOhaKs/GdYZi WTow== MIME-Version: 1.0 X-Received: by 10.236.63.5 with SMTP id z5mr1203857yhc.49.1391607652427; Wed, 05 Feb 2014 05:40:52 -0800 (PST) Received: by 10.170.187.82 with HTTP; Wed, 5 Feb 2014 05:40:52 -0800 (PST) In-Reply-To: References: <42787ebdc1dcf959348dbed2ab06b141.squirrel@webmail.informatik.uni-freiburg.de> <43cf879f8e2a57323e4b5d79eaa9e768.squirrel@webmail.informatik.uni-freiburg.de> Date: Wed, 5 Feb 2014 08:40:52 -0500 Message-ID: Subject: Re: Continuous crawling From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=089e01294d96ca2aa904f1a8e7b5 X-Virus-Checked: Checked by ClamAV on apache.org --089e01294d96ca2aa904f1a8e7b5 Content-Type: text/plain; charset=ISO-8859-1 Any luck with this? Karl On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright wrote: > I've created a branch at: > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 . > This contains my proposed fix; please try it out. If you would like, I can > also attach a patch, although I'm not certain it would apply properly onto > MCF 1.4.1 sources. > > Karl > > > > On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright wrote: > >> Hi Florian, >> >> I'm pretty sure now that what is happening is that your output connector >> is throwing some kind of exception when it is asked to remove documents >> during the cleanup phase of the crawl. The state transitions in the >> framework seem to be incorrect under these conditions, and the error is >> likely not logged into the job's error field. The ticket I've created to >> address this is CONNECTORS-880. >> >> Karl >> >> >> >> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright wrote: >> >>> The code path for an abort sequence looks pretty iron-clad. The >>> bad-case output: >>> >>> >>> >>>>>> >>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >>> 1385573203052 >>> for shutdown >>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job >>> 1385573203052 in need of notification >>> <<<<<< >>> >>> is not including: >>> >>> >>> >>>>>> >>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now >>> completed >>> <<<<<< >>> >>> is very significant, because it is in that method that the last-check >>> time would be updated typically, in the method JobManager.finishJob(). If >>> an abort took place, it would have started BEFORE all this; once the job >>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job can be >>> aborted either manually or by repository-connector related activity. At >>> that time the job is cleaning up documents that are no longer reachable. I >>> will check to see what happens if the output connector throws an exception >>> during this phase; it's the only thing I can think of that might >>> potentially derail the job from finishing. >>> >>> Karl >>> >>> >>> >>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright wrote: >>> >>>> Hi Florian, >>>> >>>> The only way this can happen is if the proper job termination state >>>> sequence does not take place. When MCF checks to see if a job should be >>>> started, if it determines that the answer is "no" it updates the job record >>>> immediately with a new "last checked" value. But if it starts the job, it >>>> waits for the job completion to take place before updating the job's "last >>>> checked" time. When a job aborts, at first glance it looks like it also >>>> does the right thing, but clearly that's not true, and there must be a bug >>>> somewhere in how this condition is handled. >>>> >>>> I'll create a ticket to research this. In the interim, I suggest you >>>> figure out why your job is aborting in the first place. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright wrote: >>>> >>>>> Hi Florian, >>>>> >>>>> I do not expect errors to appear in the tomcat log. >>>>> >>>>> But this is interesting: >>>>> >>>>> Good: >>>>> >>>>> >>>>>> >>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job >>>>> 1385573203052 needs to be started; it was last checked at >>>>> 1391439592120, >>>>> and now it is 1391439602151 >>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match FOUND >>>>> within interval 1391439592120 to 1391439602151 >>>>> ... >>>>> >>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job >>>>> 1385573203052 needs to be started; it was last checked at >>>>> 1391440412615, >>>>> and now it is 1391440427102 >>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time match found >>>>> within interval 1391440412615 to 1391440427102 >>>>> <<<<<< >>>>> "last checked" time for job is updated. >>>>> >>>>> Bad: >>>>> >>>>> >>>>>> >>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job >>>>> 1385573203052 needs to be started; it was last checked at >>>>> 1391446794075, >>>>> and now it is 1391446804106 >>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match FOUND >>>>> within interval 1391446794075 to 1391446804106 >>>>> ... >>>>> >>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job >>>>> 1385573203052 needs to be started; it was last checked at >>>>> 1391446794075, >>>>> and now it is 1391447647733 >>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match FOUND >>>>> within interval 1391446794075 to 1391447647733 >>>>> <<<<<< >>>>> Note that the "last checked" time is NOT updated. >>>>> >>>>> I don't understand why, in one case, the "last checked" time is being >>>>> updated for the job, and is not in another case. I will look to see if >>>>> there is any way in the code that this can happen. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding < >>>>> schmeddi@informatik.uni-freiburg.de> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log >>>>>> contains only the job log messages (>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log >>>>>> snippets, one from a normal run, and one where the job got repeated >>>>>> two >>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job >>>>>> notification" when the job finally terminates, and the thread sequence >>>>>> "Finisher - Job notification" when the job gets restarted again >>>>>> instead of >>>>>> terminating. >>>>>> >>>>>> >>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391439582108, >>>>>> and now it is 1391439592119 >>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) - No time match >>>>>> found >>>>>> within interval 1391439582108 to 1391439592119 >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391439592120, >>>>>> and now it is 1391439602151 >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match FOUND >>>>>> within interval 1391439592120 to 1391439602151 >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job >>>>>> '1385573203052' is >>>>>> within run window at 1391439602151 ms. (which starts at 1391439600000 >>>>>> ms.) >>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job >>>>>> start >>>>>> for job 1385573203052 >>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job >>>>>> 1385573203052 >>>>>> for startup >>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is >>>>>> now >>>>>> started >>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job >>>>>> 1385573203052 >>>>>> for shutdown >>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 >>>>>> now >>>>>> completed >>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job >>>>>> 1385573203052 in need of notification >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391440412615, >>>>>> and now it is 1391440427102 >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time match >>>>>> found >>>>>> within interval 1391440412615 to 1391440427102 >>>>>> >>>>>> >>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446784053, >>>>>> and now it is 1391446794074 >>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - No time match >>>>>> found >>>>>> within interval 1391446784053 to 1391446794074 >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391446804106 >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391446804106 >>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job >>>>>> '1385573203052' is >>>>>> within run window at 1391446804106 ms. (which starts at 1391446800000 >>>>>> ms.) >>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job >>>>>> start >>>>>> for job 1385573203052 >>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job >>>>>> 1385573203052 >>>>>> for startup >>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is >>>>>> now >>>>>> started >>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job >>>>>> 1385573203052 >>>>>> for shutdown >>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job >>>>>> 1385573203052 in need of notification >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391447647733 >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391447647733 >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job >>>>>> '1385573203052' is >>>>>> within run window at 1391447647733 ms. (which starts at 1391446800000 >>>>>> ms.) >>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391447657740 >>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391447657740 >>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job >>>>>> '1385573203052' is >>>>>> within run window at 1391447657740 ms. (which starts at 1391446800000 >>>>>> ms.) >>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job >>>>>> start >>>>>> for job 1385573203052 >>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job >>>>>> 1385573203052 >>>>>> for startup >>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is >>>>>> now >>>>>> started >>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >>>>>> 1385573203052 >>>>>> for shutdown >>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job >>>>>> 1385573203052 in need of notification >>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391448479353 >>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391448479353 >>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job >>>>>> '1385573203052' is >>>>>> within run window at 1391448479353 ms. (which starts at 1391446800000 >>>>>> ms.) >>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job >>>>>> start >>>>>> for job 1385573203052 >>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job >>>>>> 1385573203052 >>>>>> for startup >>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is >>>>>> now >>>>>> started >>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job >>>>>> 1385573203052 >>>>>> for shutdown >>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052 >>>>>> now >>>>>> completed >>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job >>>>>> 1385573203052 in need of notification >>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391449283114, >>>>>> and now it is 1391449292400 >>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - No time match >>>>>> found >>>>>> within interval 1391449283114 to 1391449292400 >>>>>> >>>>>> >>>>>> Do you need another log output? >>>>>> >>>>>> Best, >>>>>> Florian >>>>>> >>>>>> > Also, what does the log have to say? If there is an error aborting >>>>>> the >>>>>> > job, there should be some record of it in the manifoldcf.log. >>>>>> > >>>>>> > Thanks, >>>>>> > Karl >>>>>> > >>>>>> > >>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright >>>>>> wrote: >>>>>> > >>>>>> >> Hi Florian, >>>>>> >> >>>>>> >> Please run the job manually, when outside the scheduling window or >>>>>> with >>>>>> >> the scheduling off. What is the reason for the job abort? >>>>>> >> >>>>>> >> Karl >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding < >>>>>> >> schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >> >>>>>> >>> Hi Karl, >>>>>> >>> >>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time column >>>>>> when I >>>>>> >>> refreshed the job status just after the number of active >>>>>> documents was >>>>>> >>> zero. At the next refresh the job was starting up. After looking >>>>>> in the >>>>>> >>> history I found out that it even started a third time. You can >>>>>> see the >>>>>> >>> history of a single day below (job continue, end, start, stop, >>>>>> unwait, >>>>>> >>> wait). The start method is "Start at beginning of schedule >>>>>> window". Job >>>>>> >>> invocation is "complete". Hop count mode is "Delete unreachable >>>>>> >>> documents". >>>>>> >>> >>>>>> >>> 02.03.2014 18:41 job end >>>>>> >>> 02.03.2014 18:28 job start >>>>>> >>> 02.03.2014 18:14 job start >>>>>> >>> 02.03.2014 18:00 job start >>>>>> >>> 02.03.2014 17:49 job end >>>>>> >>> 02.03.2014 17:27 job end >>>>>> >>> 02.03.2014 17:13 job start >>>>>> >>> 02.03.2014 17:00 job start >>>>>> >>> 02.03.2014 16:13 job end >>>>>> >>> 02.03.2014 16:00 job start >>>>>> >>> 02.03.2014 15:41 job end >>>>>> >>> 02.03.2014 15:27 job start >>>>>> >>> 02.03.2014 15:14 job start >>>>>> >>> 02.03.2014 15:00 job start >>>>>> >>> 02.03.2014 14:13 job end >>>>>> >>> 02.03.2014 14:00 job start >>>>>> >>> 02.03.2014 13:13 job end >>>>>> >>> 02.03.2014 13:00 job start >>>>>> >>> 02.03.2014 12:27 job end >>>>>> >>> 02.03.2014 12:14 job start >>>>>> >>> 02.03.2014 12:00 job start >>>>>> >>> 02.03.2014 11:13 job end >>>>>> >>> 02.03.2014 11:00 job start >>>>>> >>> 02.03.2014 10:13 job end >>>>>> >>> 02.03.2014 10:00 job start >>>>>> >>> 02.03.2014 09:29 job end >>>>>> >>> 02.03.2014 09:14 job start >>>>>> >>> 02.03.2014 09:00 job start >>>>>> >>> >>>>>> >>> Best, >>>>>> >>> Florian >>>>>> >>> >>>>>> >>> >>>>>> >>> > Hi Florian, >>>>>> >>> > >>>>>> >>> > Jobs don't just abort randomly. Are you sure that the job >>>>>> aborted? >>>>>> >>> Or >>>>>> >>> > did >>>>>> >>> > it just restart? >>>>>> >>> > >>>>>> >>> > As for "is this normal", it depends on how you have created >>>>>> your job. >>>>>> >>> If >>>>>> >>> > you selected the "Start within schedule window" selection, MCF >>>>>> will >>>>>> >>> > restart >>>>>> >>> > the job whenever it finishes and run it until the end of the >>>>>> >>> scheduling >>>>>> >>> > window. >>>>>> >>> > >>>>>> >>> > Karl >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding < >>>>>> >>> > schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >>> > >>>>>> >>> >> Hi Karl, >>>>>> >>> >> >>>>>> >>> >> I've just observed that the job was started according to its >>>>>> >>> schedule >>>>>> >>> >> and >>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest all >>>>>> >>> documents >>>>>> >>> >> before the run). However, after finishing the last document >>>>>> (zero >>>>>> >>> active >>>>>> >>> >> documents) it was somehow aborted and restarted immediately. >>>>>> Is this >>>>>> >>> an >>>>>> >>> >> expected behavior? >>>>>> >>> >> >>>>>> >>> >> Best, >>>>>> >>> >> Florian >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> > Hi Florian, >>>>>> >>> >> > >>>>>> >>> >> > Based on this schedule, your crawls will be able to start >>>>>> whenever >>>>>> >>> the >>>>>> >>> >> > hour >>>>>> >>> >> > turns. So they can start every hour on the hour. If the >>>>>> last >>>>>> >>> crawl >>>>>> >>> >> > crossed an hour boundary, the next crawl will start >>>>>> immediately, I >>>>>> >>> >> > believe. >>>>>> >>> >> > >>>>>> >>> >> > Karl >>>>>> >>> >> > >>>>>> >>> >> > >>>>>> >>> >> > >>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding < >>>>>> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >>> >> > >>>>>> >>> >> >> Hi Karl, >>>>>> >>> >> >> >>>>>> >>> >> >> these are the values: >>>>>> >>> >> >> Priority: 5 Start method: Start at beginning >>>>>> of >>>>>> >>> >> schedule >>>>>> >>> >> >> window >>>>>> >>> >> >> Schedule type: Scan every document once Minimum >>>>>> recrawl >>>>>> >>> >> >> interval: >>>>>> >>> >> >> Not >>>>>> >>> >> >> applicable >>>>>> >>> >> >> Expiration interval: Not applicable Reseed interval: >>>>>> >>> Not >>>>>> >>> >> >> applicable >>>>>> >>> >> >> Scheduled time: Any day of week at 12 am 1 am 2 am >>>>>> 3 am 4 >>>>>> >>> am >>>>>> >>> >> 5 >>>>>> >>> >> >> am >>>>>> >>> >> >> 6 am 7 >>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 >>>>>> pm 7 pm >>>>>> >>> 8 >>>>>> >>> >> pm 9 >>>>>> >>> >> >> pm 10 pm 11 pm >>>>>> >>> >> >> Maximum run time: No limit Job invocation: >>>>>> >>> >> Complete >>>>>> >>> >> >> >>>>>> >>> >> >> Maybe it is because I've changed the job from continuous >>>>>> crawling >>>>>> >>> to >>>>>> >>> >> >> this >>>>>> >>> >> >> schedule. I started it a few times manually, too. I couldn't >>>>>> >>> notice >>>>>> >>> >> >> anything strange in the job setup or in the respective >>>>>> entries in >>>>>> >>> the >>>>>> >>> >> >> database. >>>>>> >>> >> >> >>>>>> >>> >> >> Regards, >>>>>> >>> >> >> Florian >>>>>> >>> >> >> >>>>>> >>> >> >> > Hi Florian, >>>>>> >>> >> >> > >>>>>> >>> >> >> > I was unable to reproduce the behavior you described. >>>>>> >>> >> >> > >>>>>> >>> >> >> > Could you view your job, and post a screen shot of that >>>>>> page? >>>>>> >>> I >>>>>> >>> >> want >>>>>> >>> >> >> to >>>>>> >>> >> >> > see what your schedule record(s) look like. >>>>>> >>> >> >> > >>>>>> >>> >> >> > Thanks, >>>>>> >>> >> >> > Karl >>>>>> >>> >> >> > >>>>>> >>> >> >> > >>>>>> >>> >> >> > >>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright >>>>>> >>> >>>>>> >>> >> >> wrote: >>>>>> >>> >> >> > >>>>>> >>> >> >> >> Hi Florian, >>>>>> >>> >> >> >> >>>>>> >>> >> >> >> I've never noted this behavior before. I'll see if I can >>>>>> >>> >> reproduce >>>>>> >>> >> >> it >>>>>> >>> >> >> >> here. >>>>>> >>> >> >> >> >>>>>> >>> >> >> >> Karl >>>>>> >>> >> >> >> >>>>>> >>> >> >> >> >>>>>> >>> >> >> >> >>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding < >>>>>> >>> >> >> >> schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >>> >> >> >> >>>>>> >>> >> >> >>> Hi Karl, >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. However, >>>>>> it runs >>>>>> >>> two >>>>>> >>> >> >> >>> times: >>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time, >>>>>> finishes, >>>>>> >>> and >>>>>> >>> >> >> >>> immediately starts again. After finishing the second >>>>>> run it >>>>>> >>> waits >>>>>> >>> >> >> for >>>>>> >>> >> >> >>> the >>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? The >>>>>> start >>>>>> >>> method >>>>>> >>> >> is >>>>>> >>> >> >> >>> "Start >>>>>> >>> >> >> >>> at beginning of schedule window". >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee. >>>>>> Currently, >>>>>> >>> our >>>>>> >>> >> >> interval >>>>>> >>> >> >> >>> is >>>>>> >>> >> >> >>> long enough for a complete crawler run. >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> Best, >>>>>> >>> >> >> >>> Florian >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> > Hi Florian, >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document will >>>>>> be >>>>>> >>> >> checked, >>>>>> >>> >> >> >>> because >>>>>> >>> >> >> >>> > if >>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall >>>>>> behind. >>>>>> >>> But >>>>>> >>> I >>>>>> >>> >> >> will >>>>>> >>> >> >> >>> look >>>>>> >>> >> >> >>> > into adding the feature you request. >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> > Karl >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding < >>>>>> >>> >> >> >>> > schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> >> Hi Karl, >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that >>>>>> new >>>>>> >>> >> documents >>>>>> >>> >> >> are >>>>>> >>> >> >> >>> >> discovered and indexed within a certain interval. I >>>>>> have >>>>>> >>> >> created >>>>>> >>> >> >> a >>>>>> >>> >> >> >>> >> feature >>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to use a >>>>>> >>> >> scheduled >>>>>> >>> >> >> job >>>>>> >>> >> >> >>> >> instead. >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> Thanks for your help, >>>>>> >>> >> >> >>> >> Florian >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> > Hi Florian, >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling" >>>>>> behavior. The >>>>>> >>> >> time >>>>>> >>> >> >> >>> between >>>>>> >>> >> >> >>> >> > refetches of a document is based on the history of >>>>>> >>> fetches >>>>>> >>> >> of >>>>>> >>> >> >> that >>>>>> >>> >> >> >>> >> > document. The recrawl interval is the initial time >>>>>> >>> between >>>>>> >>> >> >> >>> document >>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, the >>>>>> interval >>>>>> >>> for >>>>>> >>> >> >> the >>>>>> >>> >> >> >>> >> document >>>>>> >>> >> >> >>> >> > increases according to a formula. >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > I would need to look at the code to be able to >>>>>> give you >>>>>> >>> the >>>>>> >>> >> >> >>> precise >>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the amount of >>>>>> time >>>>>> >>> >> between >>>>>> >>> >> >> >>> >> document >>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket and >>>>>> I will >>>>>> >>> >> look >>>>>> >>> >> >> into >>>>>> >>> >> >> >>> >> adding >>>>>> >>> >> >> >>> >> > that as a feature. >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > Thanks, >>>>>> >>> >> >> >>> >> > Karl >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding >>>>>> < >>>>>> >>> >> >> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote: >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> >> Hello, >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl >>>>>> interval of >>>>>> >>> a >>>>>> >>> >> >> >>> continuous >>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The >>>>>> >>> documentation >>>>>> >>> >> >> tells >>>>>> >>> >> >> >>> that >>>>>> >>> >> >> >>> >> the >>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the seeds >>>>>> are >>>>>> >>> >> checked >>>>>> >>> >> >> >>> again, >>>>>> >>> >> >> >>> >> and >>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which a >>>>>> document >>>>>> >>> is >>>>>> >>> >> >> >>> checked >>>>>> >>> >> >> >>> >> for >>>>>> >>> >> >> >>> >> >> changes. >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl interval >>>>>> for a >>>>>> >>> >> document >>>>>> >>> >> >> >>> >> increases >>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed >>>>>> >>> interval >>>>>> >>> >> seems >>>>>> >>> >> >> to >>>>>> >>> >> >> >>> be >>>>>> >>> >> >> >>> >> set >>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about the >>>>>> seed >>>>>> >>> >> >> documents. >>>>>> >>> >> >> >>> Yet >>>>>> >>> >> >> >>> >> the >>>>>> >>> >> >> >>> >> >> web server does not receive requests at each time >>>>>> the >>>>>> >>> >> interval >>>>>> >>> >> >> >>> >> elapses >>>>>> >>> >> >> >>> >> >> but >>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed. >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server does >>>>>> not >>>>>> >>> tell >>>>>> >>> >> the >>>>>> >>> >> >> >>> client >>>>>> >>> >> >> >>> >> to >>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be >>>>>> appreciated. >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> Best regards, >>>>>> >>> >> >> >>> >> >> Florian >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> >> >>>>>> >>> >> >> >>> >> > >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> >> >>>>>> >>> >> >> >>> > >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >>> >>>>>> >>> >> >> >> >>>>>> >>> >> >> > >>>>>> >>> >> >> >>>>>> >>> >> >> >>>>>> >>> >> >> >>>>>> >>> >> > >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> > >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >> >>>>>> > >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > --089e01294d96ca2aa904f1a8e7b5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Any luck with this?
Karl


On Tue, Feb 4, 2014 at 4:15 PM, Ka= rl Wright <daddywri@gmail.com> wrote:
I've created a branch a= t: https://svn.apache.org/repos/asf/manifoldcf/branc= hes/CONNECTORS-880 .=A0 This contains my proposed fix; please try it ou= t.=A0 If you would like, I can also attach a patch, although I'm not ce= rtain it would apply properly onto MCF 1.4.1 sources.

Karl



On Tue, Feb = 4, 2014 at 2:37 PM, Karl Wright <daddywri@gmail.com> wrote:=
Hi Florian,

I'm = pretty sure now that what is happening is that your output connector is thr= owing some kind of exception when it is asked to remove documents during th= e cleanup phase of the crawl.=A0 The state transitions in the framework see= m to be incorrect under these conditions, and the error is likely not logge= d into the job's error field.=A0 The ticket I've created to address= this is CONNECTORS-880.

Karl


On Tue, Feb 4, 2014 at 2:14 PM, Karl Wrigh= t <daddywri@gmail.com> wrote:
The code path for= an abort sequence looks pretty iron-clad.=A0 The bad-case output:


>>>>>>
DEBUG 2014-02-03 18:27:45,387 (Finisher= thread) - Marked job 1385573203052
for shutdown
DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
1385573203052 in need of notification
<<<<<<

is not including:


>>>>>>
DEBUG 2= 014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
completed
<<<<<<

is very significan= t, because it is in that method that the last-check time would be updated t= ypically, in the method JobManager.finishJob().=A0 If an abort took place, = it would have started BEFORE all this; once the job state gets set to STATU= S_SHUTTINGDOWN, there is no way that the job can be aborted either manually= or by repository-connector related activity.=A0 At that time the job is cl= eaning up documents that are no longer reachable.=A0 I will check to see wh= at happens if the output connector throws an exception during this phase; i= t's the only thing I can think of that might potentially derail the job= from finishing.

Karl


On Tue, Feb 4, 2014 at 1:29 PM, Karl Wrigh= t <daddywri@gmail.com> wrote:
Hi Florian,<= br>
The only way this can happen is if the proper job termination = state sequence does not take place.=A0 When MCF checks to see if a job shou= ld be started, if it determines that the answer is "no" it update= s the job record immediately with a new "last checked" value.=A0 = But if it starts the job, it waits for the job completion to take place bef= ore updating the job's "last checked" time.=A0 When a job abo= rts, at first glance it looks like it also does the right thing, but clearl= y that's not true, and there must be a bug somewhere in how this condit= ion is handled.

I'll create a ticket to research this. In the interim, I sugg= est you figure out why your job is aborting in the first place.

Thanks,
Karl


On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Florian,

I do not expe= ct errors to appear in the tomcat log.

But this is interesting= :

Good:

>>>>>>
DEBUG 2014-02-03 16:00:0= 2,153 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391439592120,and now it is 1391439602151
DEBUG 2014-02-03 16:00:02,153 (Job start th= read) -=A0 Time match FOUND
within interval 1391439592120 to 13914396021= 51
...

DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if = job
1385573203052 needs to be started; it was last checked at 1391440412= 615,
and now it is 1391440427102
DEBUG 2014-02-03 16:13:47,105 (Job s= tart thread) -=A0 No time match found
within interval 1391440412615 to 1391440427102
<<<<<<<= br>
"last checked" time for job is updated.

Bad:
=
>>>>>>
DEBUG 2014-02-03 18:00:04,109 (Job start th= read) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446794075,and now it is 1391446804106
DEBUG 2014-02-03 18:00:04,109 (Job start th= read) -=A0 Time match FOUND
within interval 1391446794075 to 13914468041= 06
...

DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if = job
1385573203052 needs to be started; it was last checked at 1391446794= 075,
and now it is 1391447647733
DEBUG 2014-02-03 18:14:07,736 (Job s= tart thread) -=A0 Time match FOUND
within interval 1391446794075 to 1391447647733
<<<<<<<= br>
Note that the "last checked" time is NOT updated.
I don't understand why, in one case, the "last checked"= ; time is being updated for the job, and is not in another case.=A0 I will = look to see if there is any way in the code that this can happen.

Karl



On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <schmeddi@informatik.uni-freiburg.de> wrote:
Hi Karl,

there are no errors in the Tomcat logs. Currently, the Manifold log
contains only the job log messages (<property
name=3D"org.apache.manifoldcf.jobs" value=3D"ALL"/>). I inc= lude two log
snippets, one from a normal run, and one where the job got repeated two
times. I noticed the thread sequence "Finisher - Job reset - Job
notification" when the job finally terminates, and the thread sequence=
"Finisher - Job notification" when the job gets restarted again i= nstead of
terminating.


DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391439582108, and now it is 1391439592119
DEBUG 2014-02-03 15:59:52,131 (Job start thread) - =A0No time match found within interval 1391439582108 to 1391439592119
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391439592120, and now it is 1391439602151
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - =A0Time match FOUND
within interval 1391439592120 to 1391439602151
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job '1385573203052&#= 39; is
within run window at 1391439602151 ms. (which starts at 1391439600000 ms.)<= br> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job start<= br> for job 1385573203052
DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job 1385573203052 for startup
DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is now started
DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job 1385573203052<= br> for shutdown
DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now completed
DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
1385573203052 in need of notification
DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391440412615, and now it is 1391440427102
DEBUG 2014-02-03 16:13:47,105 (Job start thread) - =A0No time match found within interval 1391440412615 to 1391440427102


DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446784053, and now it is 1391446794074
DEBUG 2014-02-03 17:59:54,078 (Job start thread) - =A0No time match found within interval 1391446784053 to 1391446794074
DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446794075, and now it is 1391446804106
DEBUG 2014-02-03 18:00:04,109 (Job start thread) - =A0Time match FOUND
within interval 1391446794075 to 1391446804106
DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job '1385573203052&#= 39; is
within run window at 1391446804106 ms. (which starts at 1391446800000 ms.)<= br> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job start<= br> for job 1385573203052
DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job 1385573203052 for startup
DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is now started
DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job 1385573203052<= br> for shutdown
DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
1385573203052 in need of notification
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446794075, and now it is 1391447647733
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - =A0Time match FOUND
within interval 1391446794075 to 1391447647733
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job '1385573203052&#= 39; is
within run window at 1391447647733 ms. (which starts at 1391446800000 ms.)<= br> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446794075, and now it is 1391447657740
DEBUG 2014-02-03 18:14:17,744 (Job start thread) - =A0Time match FOUND
within interval 1391446794075 to 1391447657740
DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job '1385573203052&#= 39; is
within run window at 1391447657740 ms. (which starts at 1391446800000 ms.)<= br> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job start<= br> for job 1385573203052
DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job 1385573203052 for startup
DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is now started
DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job 1385573203052<= br> for shutdown
DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
1385573203052 in need of notification
DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391446794075, and now it is 1391448479353
DEBUG 2014-02-03 18:27:59,358 (Job start thread) - =A0Time match FOUND
within interval 1391446794075 to 1391448479353
DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job '1385573203052&#= 39; is
within run window at 1391448479353 ms. (which starts at 1391446800000 ms.)<= br> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job start<= br> for job 1385573203052
DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job 1385573203052 for startup
DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is now started
DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job 1385573203052<= br> for shutdown
DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052 now completed
DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
1385573203052 in need of notification
DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
1385573203052 needs to be started; it was last checked at 1391449283114, and now it is 1391449292400
DEBUG 2014-02-03 18:41:32,403 (Job start thread) - =A0No time match found within interval 1391449283114 to 1391449292400


Do you need another log output?

Best,
Florian

> Also, what does the log have to say? =A0If there is an error aborting = the
> job, there should be some record of it in the manifoldcf.log.
>
> Thanks,
> Karl
>
>
> On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Florian,
>>
>> Please run the job manually, when outside the scheduling window or= with
>> the scheduling off. =A0What is the reason for the job abort?
>>
>> Karl
>>
>>
>>
>> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>> schmeddi@informatik.uni-freiburg.de> wrote:
>>
>>> Hi Karl,
>>>
>>> yes, I've coincidentally seen "Aborted" in the e= nd time column when I
>>> refreshed the job status just after the number of active docum= ents was
>>> zero. At the next refresh the job was starting up. After looki= ng in the
>>> history I found out that it even started a third time. You can= see the
>>> history of a single day below (job continue, end, start, stop,= unwait,
>>> wait). The start method is "Start at beginning of schedul= e window". Job
>>> invocation is "complete". Hop count mode is "De= lete unreachable
>>> documents".
>>>
>>> 02.03.2014 18:41 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 18:28 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 18:14 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 18:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 17:49 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 17:27 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 17:13 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 17:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 16:13 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 16:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 15:41 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 15:27 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 15:14 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 15:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 14:13 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 14:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 13:13 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 13:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 12:27 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 12:14 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 12:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 11:13 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 11:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 10:13 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 10:00 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 09:29 =A0 =A0 =A0 =A0job end
>>> 02.03.2014 09:14 =A0 =A0 =A0 =A0job start
>>> 02.03.2014 09:00 =A0 =A0 =A0 =A0job start
>>>
>>> Best,
>>> Florian
>>>
>>>
>>> > Hi Florian,
>>> >
>>> > Jobs don't just abort randomly. =A0Are you sure that = the job aborted?
>>> Or
>>> > did
>>> > it just restart?
>>> >
>>> > As for "is this normal", it depends on how you = have created your job.
>>> =A0If
>>> > you selected the "Start within schedule window"= selection, MCF will
>>> > restart
>>> > the job whenever it finishes and run it until the end of = the
>>> scheduling
>>> > window.
>>> >
>>> > Karl
>>> >
>>> >
>>> >
>>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <<= br> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>> >
>>> >> Hi Karl,
>>> >>
>>> >> I've just observed that the job was started accor= ding to its
>>> schedule
>>> >> and
>>> >> crawled all documents correctly (I've chosen to r= e-ingest all
>>> documents
>>> >> before the run). However, after finishing the last do= cument (zero
>>> active
>>> >> documents) it was somehow aborted and restarted immed= iately. Is this
>>> an
>>> >> expected behavior?
>>> >>
>>> >> Best,
>>> >> Florian
>>> >>
>>> >>
>>> >> > Hi Florian,
>>> >> >
>>> >> > Based on this schedule, your crawls will be able= to start whenever
>>> the
>>> >> > hour
>>> >> > turns. =A0So they can start every hour on the ho= ur. =A0If the last
>>> crawl
>>> >> > crossed an hour boundary, the next crawl will st= art immediately, I
>>> >> > believe.
>>> >> >
>>> >> > Karl
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedd= ing <
>>> >> > schmeddi@informatik.uni-freiburg.de> wrote:=
>>> >> >
>>> >> >> Hi Karl,
>>> >> >>
>>> >> >> these are the values:
>>> >> >> Priority: =A0 =A0 =A0 5 =A0 =A0 =A0 Start me= thod: =A0 Start at beginning of
>>> >> schedule
>>> >> >> window
>>> >> >> Schedule type: =A0Scan every document once = =A0 =A0 =A0 =A0Minimum recrawl
>>> >> >> interval:
>>> >> >> =A0 =A0 =A0 Not
>>> >> >> applicable
>>> >> >> Expiration interval: =A0 =A0Not applicable = =A0Reseed interval:
>>> Not
>>> >> >> applicable
>>> >> >> Scheduled time: =A0 =A0 =A0 =A0 Any day of w= eek at 12 am 1 am 2 am 3 am 4
>>> am
>>> >> 5
>>> >> >> am
>>> >> >> 6 am 7
>>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 p= m 4 pm 5 pm 6 pm 7 pm
>>> 8
>>> >> pm 9
>>> >> >> pm 10 pm 11 pm
>>> >> >> Maximum run time: =A0 =A0 =A0 No limit =A0 = =A0 =A0 =A0Job invocation:
>>> >> Complete
>>> >> >>
>>> >> >> Maybe it is because I've changed the job= from continuous crawling
>>> to
>>> >> >> this
>>> >> >> schedule. I started it a few times manually,= too. I couldn't
>>> notice
>>> >> >> anything strange in the job setup or in the = respective entries in
>>> the
>>> >> >> database.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Florian
>>> >> >>
>>> >> >> > Hi Florian,
>>> >> >> >
>>> >> >> > I was unable to reproduce the behavior = you described.
>>> >> >> >
>>> >> >> > Could you view your job, and post a scr= een shot of that page?
>>> I
>>> >> want
>>> >> >> to
>>> >> >> > see what your schedule record(s) look l= ike.
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> > Karl
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl W= right
>>> <da= ddywri@gmail.com>
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> Hi Florian,
>>> >> >> >>
>>> >> >> >> I've never noted this behavior = before. =A0I'll see if I can
>>> >> reproduce
>>> >> >> it
>>> >> >> >> here.
>>> >> >> >>
>>> >> >> >> Karl
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Fl= orian Schmedding <
>>> >> >> >> schmeddi@informatik.uni-freiburg.de> wrote:
>>> >> >> >>
>>> >> >> >>> Hi Karl,
>>> >> >> >>>
>>> >> >> >>> the scheduled job seems to work= as expecetd. However, it runs
>>> two
>>> >> >> >>> times:
>>> >> >> >>> It starts at the beginning of t= he scheduled time, finishes,
>>> and
>>> >> >> >>> immediately starts again. After= finishing the second run it
>>> waits
>>> >> >> for
>>> >> >> >>> the
>>> >> >> >>> next scheduled time. Why does i= t run two times? The start
>>> method
>>> >> is
>>> >> >> >>> "Start
>>> >> >> >>> at beginning of schedule window= ".
>>> >> >> >>>
>>> >> >> >>> Yes, you're right about the= checking guarantee. Currently,
>>> our
>>> >> >> interval
>>> >> >> >>> is
>>> >> >> >>> long enough for a complete craw= ler run.
>>> >> >> >>>
>>> >> >> >>> Best,
>>> >> >> >>> Florian
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> > Hi Florian,
>>> >> >> >>> >
>>> >> >> >>> > It is impossible to *guara= ntee* that a document will be
>>> >> checked,
>>> >> >> >>> because
>>> >> >> >>> > if
>>> >> >> >>> > load on the crawler is hig= h enough, it will fall behind.
>>> But
>>> I
>>> >> >> will
>>> >> >> >>> look
>>> >> >> >>> > into adding the feature yo= u request.
>>> >> >> >>> >
>>> >> >> >>> > Karl
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > On Sun, Jan 5, 2014 at 9:0= 8 AM, Florian Schmedding <
>>> >> >> >>> >
schmeddi@informatik.uni-frei= burg.de> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> Hi Karl,
>>> >> >> >>> >>
>>> >> >> >>> >> yes, in our case it is= necessary to make sure that new
>>> >> documents
>>> >> >> are
>>> >> >> >>> >> discovered and indexed= within a certain interval. I have
>>> >> created
>>> >> >> a
>>> >> >> >>> >> feature
>>> >> >> >>> >> request on that. In th= e meantime we will try to use a
>>> >> scheduled
>>> >> >> job
>>> >> >> >>> >> instead.
>>> >> >> >>> >>
>>> >> >> >>> >> Thanks for your help,<= br> >>> >> >> >>> >> Florian
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>> >> > Hi Florian,
>>> >> >> >>> >> >
>>> >> >> >>> >> > What you are seei= ng is "dynamic crawling" behavior. =A0The
>>> >> time
>>> >> >> >>> between
>>> >> >> >>> >> > refetches of a do= cument is based on the history of
>>> fetches
>>> >> of
>>> >> >> that
>>> >> >> >>> >> > document. =A0The = recrawl interval is the initial time
>>> between
>>> >> >> >>> document
>>> >> >> >>> >> > fetches, but if a= document does not change, the interval
>>> for
>>> >> >> the
>>> >> >> >>> >> document
>>> >> >> >>> >> > increases accordi= ng to a formula.
>>> >> >> >>> >> >
>>> >> >> >>> >> > I would need to l= ook at the code to be able to give you
>>> the
>>> >> >> >>> precise
>>> >> >> >>> >> > formula, but if y= ou need a limit on the amount of time
>>> >> between
>>> >> >> >>> >> document
>>> >> >> >>> >> > fetch attempts, I= suggest you create a ticket and I will
>>> >> look
>>> >> >> into
>>> >> >> >>> >> adding
>>> >> >> >>> >> > that as a feature= .
>>> >> >> >>> >> >
>>> >> >> >>> >> > Thanks,
>>> >> >> >>> >> > Karl
>>> >> >> >>> >> >
>>> >> >> >>> >> >
>>> >> >> >>> >> >
>>> >> >> >>> >> > On Sat, Jan 4, 20= 14 at 7:56 AM, Florian Schmedding <
>>> >> >> >>> >> > schmeddi@informatik= .uni-freiburg.de> wrote:
>>> >> >> >>> >> >
>>> >> >> >>> >> >> Hello,
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> the parameter= s reseed interval and recrawl interval of
>>> a
>>> >> >> >>> continuous
>>> >> >> >>> >> >> crawling job = are not quite clear to me. The
>>> documentation
>>> >> >> tells
>>> >> >> >>> that
>>> >> >> >>> >> the
>>> >> >> >>> >> >> reseed interv= al is the time after which the seeds are
>>> >> checked
>>> >> >> >>> again,
>>> >> >> >>> >> and
>>> >> >> >>> >> >> the recrawl i= nterval is the time after which a document
>>> is
>>> >> >> >>> checked
>>> >> >> >>> >> for
>>> >> >> >>> >> >> changes.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> However, we o= bserved that the recrawl interval for a
>>> >> document
>>> >> >> >>> >> increases
>>> >> >> >>> >> >> after each ch= eck. On the other hand, the reseed
>>> interval
>>> >> seems
>>> >> >> to
>>> >> >> >>> be
>>> >> >> >>> >> set
>>> >> >> >>> >> >> up correctly = in the database metadata about the seed
>>> >> >> documents.
>>> >> >> >>> Yet
>>> >> >> >>> >> the
>>> >> >> >>> >> >> web server do= es not receive requests at each time the
>>> >> interval
>>> >> >> >>> >> elapses
>>> >> >> >>> >> >> but
>>> >> >> >>> >> >> only after se= veral intervals have elapsed.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> We are using = a web connector. The web server does not
>>> tell
>>> >> the
>>> >> >> >>> client
>>> >> >> >>> >> to
>>> >> >> >>> >> >> cache the doc= uments. Any help would be appreciated.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> Best regards,=
>>> >> >> >>> >> >> Florian
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>> >
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>
>








--089e01294d96ca2aa904f1a8e7b5--