Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@manifoldcf.apache.org
Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates
 209.85.213.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALUFAGDKMN7WjWffgZQfxSh-9s+YFHkip=4hpqK2AT3iPb0xkQ@mail.gmail.com>
References: 
 <e22dc71d18e17938f1214b9537656ac2.squirrel@webmail.informatik.uni-freiburg.de>
	<e791d062b31c2008e57ab05296050d08.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGCMU0P_DokFBiwN3R0S2BA9fR-3CSWBuGnRcGDQk4Z_sA@mail.gmail.com>
	<42787ebdc1dcf959348dbed2ab06b141.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGAnukz8MSJm56yahGxA3RtJRBJDjghHjt01Hs=q-N3pNQ@mail.gmail.com>
	<CALUFAGDySKEZBwNETupHU=2Y5zhpNwe4NKG7LDAsCU0iXPRK+w@mail.gmail.com>
	<43cf879f8e2a57323e4b5d79eaa9e768.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGD+K7=U+Q9d7iXEb+6q8ou487HbVnE1+AUoXDEiq2KQ5A@mail.gmail.com>
	<de65c29ba36552e799836ddfac0e0995.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGBiXcCf7fvmG1gS5=4vcd=wD7KDemmHwudpjrQn_xNrPA@mail.gmail.com>
	<d1d6b7d77e36ed71cfcddb7539650140.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGD1R+ZyyBtA7bmWRrQ-otLP6bQfkfq5CwdGrcWkcPUa=Q@mail.gmail.com>
	<CALUFAGD9eWR3G2ACfOJfrZm3dAsqx+e_RFnSkRzxURgrccAL-A@mail.gmail.com>
	<afadc69d8c0885f9e14e92d9c2d3811f.squirrel@webmail.informatik.uni-freiburg.de>
	<CALUFAGB91YavMDjzC8R4WG_zEVBdo8+NFTNKNDvFS4QyjgM_0g@mail.gmail.com>
	<CALUFAGBFBv_yqGEw5keGB7djfksJ0BxrF0==v1R0xgzn_kKshA@mail.gmail.com>
	<CALUFAGA+z97shHR19JOyeUOrQVEZTTQkvGuU7QrUB+sguYMeRg@mail.gmail.com>
	<CALUFAGCKkjqCbwnaCtm8wjo03W3oLRDwPX_ubsZRfwxNj8abxg@mail.gmail.com>
	<CALUFAGDKMN7WjWffgZQfxSh-9s+YFHkip=4hpqK2AT3iPb0xkQ@mail.gmail.com>
Date: Wed, 5 Feb 2014 08:40:52 -0500
Message-ID: 
 <CALUFAGA3isdtC7m8BXCX05ObmYRW=uC-Tv1YTcviFiin2gJgOw@mail.gmail.com>
Subject: Re: Continuous crawling
From: Karl Wright <daddywri@gmail.com>
To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=089e01294d96ca2aa904f1a8e7b5

--089e01294d96ca2aa904f1a8e7b5
Content-Type: text/plain; charset=ISO-8859-1

Any luck with this?
Karl


On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <daddywri@gmail.com> wrote:

> I've created a branch at:
> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 .
> This contains my proposed fix; please try it out.  If you would like, I can
> also attach a patch, although I'm not certain it would apply properly onto
> MCF 1.4.1 sources.
>
> Karl
>
>
>
> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Florian,
>>
>> I'm pretty sure now that what is happening is that your output connector
>> is throwing some kind of exception when it is asked to remove documents
>> during the cleanup phase of the crawl.  The state transitions in the
>> framework seem to be incorrect under these conditions, and the error is
>> likely not logged into the job's error field.  The ticket I've created to
>> address this is CONNECTORS-880.
>>
>> Karl
>>
>>
>>
>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The code path for an abort sequence looks pretty iron-clad.  The
>>> bad-case output:
>>>
>>>
>>> >>>>>>
>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> <<<<<<
>>>
>>> is not including:
>>>
>>>
>>> >>>>>>
>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
>>> completed
>>> <<<<<<
>>>
>>> is very significant, because it is in that method that the last-check
>>> time would be updated typically, in the method JobManager.finishJob().  If
>>> an abort took place, it would have started BEFORE all this; once the job
>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job can be
>>> aborted either manually or by repository-connector related activity.  At
>>> that time the job is cleaning up documents that are no longer reachable.  I
>>> will check to see what happens if the output connector throws an exception
>>> during this phase; it's the only thing I can think of that might
>>> potentially derail the job from finishing.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Florian,
>>>>
>>>> The only way this can happen is if the proper job termination state
>>>> sequence does not take place.  When MCF checks to see if a job should be
>>>> started, if it determines that the answer is "no" it updates the job record
>>>> immediately with a new "last checked" value.  But if it starts the job, it
>>>> waits for the job completion to take place before updating the job's "last
>>>> checked" time.  When a job aborts, at first glance it looks like it also
>>>> does the right thing, but clearly that's not true, and there must be a bug
>>>> somewhere in how this condition is handled.
>>>>
>>>> I'll create a ticket to research this. In the interim, I suggest you
>>>> figure out why your job is aborting in the first place.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <daddywri@gmail.com>wrote:
>>>>
>>>>> Hi Florian,
>>>>>
>>>>> I do not expect errors to appear in the tomcat log.
>>>>>
>>>>> But this is interesting:
>>>>>
>>>>> Good:
>>>>>
>>>>> >>>>>>
>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391439592120,
>>>>> and now it is 1391439602151
>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>>>> within interval 1391439592120 to 1391439602151
>>>>>  ...
>>>>>
>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391440412615,
>>>>> and now it is 1391440427102
>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
>>>>> within interval 1391440412615 to 1391440427102
>>>>> <<<<<<
>>>>> "last checked" time for job is updated.
>>>>>
>>>>> Bad:
>>>>>
>>>>> >>>>>>
>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391446794075,
>>>>> and now it is 1391446804106
>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>>>> within interval 1391446794075 to 1391446804106
>>>>>  ...
>>>>>
>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391446794075,
>>>>> and now it is 1391447647733
>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>>>> within interval 1391446794075 to 1391447647733
>>>>> <<<<<<
>>>>> Note that the "last checked" time is NOT updated.
>>>>>
>>>>> I don't understand why, in one case, the "last checked" time is being
>>>>> updated for the job, and is not in another case.  I will look to see if
>>>>> there is any way in the code that this can happen.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
>>>>> schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log
>>>>>> contains only the job log messages (<property
>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log
>>>>>> snippets, one from a normal run, and one where the job got repeated
>>>>>> two
>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job
>>>>>> notification" when the job finally terminates, and the thread sequence
>>>>>> "Finisher - Job notification" when the job gets restarted again
>>>>>> instead of
>>>>>> terminating.
>>>>>>
>>>>>>
>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391439582108,
>>>>>> and now it is 1391439592119
>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391439582108 to 1391439592119
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391439592120,
>>>>>> and now it is 1391439602151
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391439592120 to 1391439602151
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391439602151 ms. (which starts at 1391439600000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052
>>>>>> now
>>>>>> completed
>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391440412615,
>>>>>> and now it is 1391440427102
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391440412615 to 1391440427102
>>>>>>
>>>>>>
>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446784053,
>>>>>> and now it is 1391446794074
>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391446784053 to 1391446794074
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391446804106
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391446804106
>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391446804106 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391447647733
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391447647733
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391447647733 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391447657740
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391447657740
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391447657740 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391448479353
>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391448479353
>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391448479353 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052
>>>>>> now
>>>>>> completed
>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391449283114,
>>>>>> and now it is 1391449292400
>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391449283114 to 1391449292400
>>>>>>
>>>>>>
>>>>>> Do you need another log output?
>>>>>>
>>>>>> Best,
>>>>>> Florian
>>>>>>
>>>>>> > Also, what does the log have to say?  If there is an error aborting
>>>>>> the
>>>>>> > job, there should be some record of it in the manifoldcf.log.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Karl
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi Florian,
>>>>>> >>
>>>>>> >> Please run the job manually, when outside the scheduling window or
>>>>>> with
>>>>>> >> the scheduling off.  What is the reason for the job abort?
>>>>>> >>
>>>>>> >> Karl
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>>>>>> >> schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>
>>>>>> >>> Hi Karl,
>>>>>> >>>
>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time column
>>>>>> when I
>>>>>> >>> refreshed the job status just after the number of active
>>>>>> documents was
>>>>>> >>> zero. At the next refresh the job was starting up. After looking
>>>>>> in the
>>>>>> >>> history I found out that it even started a third time. You can
>>>>>> see the
>>>>>> >>> history of a single day below (job continue, end, start, stop,
>>>>>> unwait,
>>>>>> >>> wait). The start method is "Start at beginning of schedule
>>>>>> window". Job
>>>>>> >>> invocation is "complete". Hop count mode is "Delete unreachable
>>>>>> >>> documents".
>>>>>> >>>
>>>>>> >>> 02.03.2014 18:41        job end
>>>>>> >>> 02.03.2014 18:28        job start
>>>>>> >>> 02.03.2014 18:14        job start
>>>>>> >>> 02.03.2014 18:00        job start
>>>>>> >>> 02.03.2014 17:49        job end
>>>>>> >>> 02.03.2014 17:27        job end
>>>>>> >>> 02.03.2014 17:13        job start
>>>>>> >>> 02.03.2014 17:00        job start
>>>>>> >>> 02.03.2014 16:13        job end
>>>>>> >>> 02.03.2014 16:00        job start
>>>>>> >>> 02.03.2014 15:41        job end
>>>>>> >>> 02.03.2014 15:27        job start
>>>>>> >>> 02.03.2014 15:14        job start
>>>>>> >>> 02.03.2014 15:00        job start
>>>>>> >>> 02.03.2014 14:13        job end
>>>>>> >>> 02.03.2014 14:00        job start
>>>>>> >>> 02.03.2014 13:13        job end
>>>>>> >>> 02.03.2014 13:00        job start
>>>>>> >>> 02.03.2014 12:27        job end
>>>>>> >>> 02.03.2014 12:14        job start
>>>>>> >>> 02.03.2014 12:00        job start
>>>>>> >>> 02.03.2014 11:13        job end
>>>>>> >>> 02.03.2014 11:00        job start
>>>>>> >>> 02.03.2014 10:13        job end
>>>>>> >>> 02.03.2014 10:00        job start
>>>>>> >>> 02.03.2014 09:29        job end
>>>>>> >>> 02.03.2014 09:14        job start
>>>>>> >>> 02.03.2014 09:00        job start
>>>>>> >>>
>>>>>> >>> Best,
>>>>>> >>> Florian
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> > Hi Florian,
>>>>>> >>> >
>>>>>> >>> > Jobs don't just abort randomly.  Are you sure that the job
>>>>>> aborted?
>>>>>> >>> Or
>>>>>> >>> > did
>>>>>> >>> > it just restart?
>>>>>> >>> >
>>>>>> >>> > As for "is this normal", it depends on how you have created
>>>>>> your job.
>>>>>> >>>  If
>>>>>> >>> > you selected the "Start within schedule window" selection, MCF
>>>>>> will
>>>>>> >>> > restart
>>>>>> >>> > the job whenever it finishes and run it until the end of the
>>>>>> >>> scheduling
>>>>>> >>> > window.
>>>>>> >>> >
>>>>>> >>> > Karl
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>>>>>> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >
>>>>>> >>> >> Hi Karl,
>>>>>> >>> >>
>>>>>> >>> >> I've just observed that the job was started according to its
>>>>>> >>> schedule
>>>>>> >>> >> and
>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest all
>>>>>> >>> documents
>>>>>> >>> >> before the run). However, after finishing the last document
>>>>>> (zero
>>>>>> >>> active
>>>>>> >>> >> documents) it was somehow aborted and restarted immediately.
>>>>>> Is this
>>>>>> >>> an
>>>>>> >>> >> expected behavior?
>>>>>> >>> >>
>>>>>> >>> >> Best,
>>>>>> >>> >> Florian
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> > Hi Florian,
>>>>>> >>> >> >
>>>>>> >>> >> > Based on this schedule, your crawls will be able to start
>>>>>> whenever
>>>>>> >>> the
>>>>>> >>> >> > hour
>>>>>> >>> >> > turns.  So they can start every hour on the hour.  If the
>>>>>> last
>>>>>> >>> crawl
>>>>>> >>> >> > crossed an hour boundary, the next crawl will start
>>>>>> immediately, I
>>>>>> >>> >> > believe.
>>>>>> >>> >> >
>>>>>> >>> >> > Karl
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
>>>>>> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >> >
>>>>>> >>> >> >> Hi Karl,
>>>>>> >>> >> >>
>>>>>> >>> >> >> these are the values:
>>>>>> >>> >> >> Priority:       5       Start method:   Start at beginning
>>>>>> of
>>>>>> >>> >> schedule
>>>>>> >>> >> >> window
>>>>>> >>> >> >> Schedule type:  Scan every document once        Minimum
>>>>>> recrawl
>>>>>> >>> >> >> interval:
>>>>>> >>> >> >>       Not
>>>>>> >>> >> >> applicable
>>>>>> >>> >> >> Expiration interval:    Not applicable  Reseed interval:
>>>>>> >>> Not
>>>>>> >>> >> >> applicable
>>>>>> >>> >> >> Scheduled time:         Any day of week at 12 am 1 am 2 am
>>>>>> 3 am 4
>>>>>> >>> am
>>>>>> >>> >> 5
>>>>>> >>> >> >> am
>>>>>> >>> >> >> 6 am 7
>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6
>>>>>> pm 7 pm
>>>>>> >>> 8
>>>>>> >>> >> pm 9
>>>>>> >>> >> >> pm 10 pm 11 pm
>>>>>> >>> >> >> Maximum run time:       No limit        Job invocation:
>>>>>> >>> >> Complete
>>>>>> >>> >> >>
>>>>>> >>> >> >> Maybe it is because I've changed the job from continuous
>>>>>> crawling
>>>>>> >>> to
>>>>>> >>> >> >> this
>>>>>> >>> >> >> schedule. I started it a few times manually, too. I couldn't
>>>>>> >>> notice
>>>>>> >>> >> >> anything strange in the job setup or in the respective
>>>>>> entries in
>>>>>> >>> the
>>>>>> >>> >> >> database.
>>>>>> >>> >> >>
>>>>>> >>> >> >> Regards,
>>>>>> >>> >> >> Florian
>>>>>> >>> >> >>
>>>>>> >>> >> >> > Hi Florian,
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > I was unable to reproduce the behavior you described.
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > Could you view your job, and post a screen shot of that
>>>>>> page?
>>>>>> >>> I
>>>>>> >>> >> want
>>>>>> >>> >> >> to
>>>>>> >>> >> >> > see what your schedule record(s) look like.
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > Thanks,
>>>>>> >>> >> >> > Karl
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
>>>>>> >>> <daddywri@gmail.com>
>>>>>> >>> >> >> wrote:
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >> Hi Florian,
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> I've never noted this behavior before.  I'll see if I can
>>>>>> >>> >> reproduce
>>>>>> >>> >> >> it
>>>>>> >>> >> >> >> here.
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> Karl
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>>>>>> >>> >> >> >> schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>> Hi Karl,
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. However,
>>>>>> it runs
>>>>>> >>> two
>>>>>> >>> >> >> >>> times:
>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time,
>>>>>> finishes,
>>>>>> >>> and
>>>>>> >>> >> >> >>> immediately starts again. After finishing the second
>>>>>> run it
>>>>>> >>> waits
>>>>>> >>> >> >> for
>>>>>> >>> >> >> >>> the
>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? The
>>>>>> start
>>>>>> >>> method
>>>>>> >>> >> is
>>>>>> >>> >> >> >>> "Start
>>>>>> >>> >> >> >>> at beginning of schedule window".
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee.
>>>>>> Currently,
>>>>>> >>> our
>>>>>> >>> >> >> interval
>>>>>> >>> >> >> >>> is
>>>>>> >>> >> >> >>> long enough for a complete crawler run.
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> Best,
>>>>>> >>> >> >> >>> Florian
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> > Hi Florian,
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document will
>>>>>> be
>>>>>> >>> >> checked,
>>>>>> >>> >> >> >>> because
>>>>>> >>> >> >> >>> > if
>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall
>>>>>> behind.
>>>>>> >>> But
>>>>>> >>> I
>>>>>> >>> >> >> will
>>>>>> >>> >> >> >>> look
>>>>>> >>> >> >> >>> > into adding the feature you request.
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > Karl
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>>>>>> >>> >> >> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> >> Hi Karl,
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that
>>>>>> new
>>>>>> >>> >> documents
>>>>>> >>> >> >> are
>>>>>> >>> >> >> >>> >> discovered and indexed within a certain interval. I
>>>>>> have
>>>>>> >>> >> created
>>>>>> >>> >> >> a
>>>>>> >>> >> >> >>> >> feature
>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to use a
>>>>>> >>> >> scheduled
>>>>>> >>> >> >> job
>>>>>> >>> >> >> >>> >> instead.
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> Thanks for your help,
>>>>>> >>> >> >> >>> >> Florian
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> > Hi Florian,
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling"
>>>>>> behavior.  The
>>>>>> >>> >> time
>>>>>> >>> >> >> >>> between
>>>>>> >>> >> >> >>> >> > refetches of a document is based on the history of
>>>>>> >>> fetches
>>>>>> >>> >> of
>>>>>> >>> >> >> that
>>>>>> >>> >> >> >>> >> > document.  The recrawl interval is the initial time
>>>>>> >>> between
>>>>>> >>> >> >> >>> document
>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, the
>>>>>> interval
>>>>>> >>> for
>>>>>> >>> >> >> the
>>>>>> >>> >> >> >>> >> document
>>>>>> >>> >> >> >>> >> > increases according to a formula.
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able to
>>>>>> give you
>>>>>> >>> the
>>>>>> >>> >> >> >>> precise
>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the amount of
>>>>>> time
>>>>>> >>> >> between
>>>>>> >>> >> >> >>> >> document
>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket and
>>>>>> I will
>>>>>> >>> >> look
>>>>>> >>> >> >> into
>>>>>> >>> >> >> >>> >> adding
>>>>>> >>> >> >> >>> >> > that as a feature.
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > Thanks,
>>>>>> >>> >> >> >>> >> > Karl
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding
>>>>>> <
>>>>>> >>> >> >> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >> Hello,
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl
>>>>>> interval of
>>>>>> >>> a
>>>>>> >>> >> >> >>> continuous
>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The
>>>>>> >>> documentation
>>>>>> >>> >> >> tells
>>>>>> >>> >> >> >>> that
>>>>>> >>> >> >> >>> >> the
>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the seeds
>>>>>> are
>>>>>> >>> >> checked
>>>>>> >>> >> >> >>> again,
>>>>>> >>> >> >> >>> >> and
>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which a
>>>>>> document
>>>>>> >>> is
>>>>>> >>> >> >> >>> checked
>>>>>> >>> >> >> >>> >> for
>>>>>> >>> >> >> >>> >> >> changes.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl interval
>>>>>> for a
>>>>>> >>> >> document
>>>>>> >>> >> >> >>> >> increases
>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed
>>>>>> >>> interval
>>>>>> >>> >> seems
>>>>>> >>> >> >> to
>>>>>> >>> >> >> >>> be
>>>>>> >>> >> >> >>> >> set
>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about the
>>>>>> seed
>>>>>> >>> >> >> documents.
>>>>>> >>> >> >> >>> Yet
>>>>>> >>> >> >> >>> >> the
>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each time
>>>>>> the
>>>>>> >>> >> interval
>>>>>> >>> >> >> >>> >> elapses
>>>>>> >>> >> >> >>> >> >> but
>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server does
>>>>>> not
>>>>>> >>> tell
>>>>>> >>> >> the
>>>>>> >>> >> >> >>> client
>>>>>> >>> >> >> >>> >> to
>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be
>>>>>> appreciated.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> Best regards,
>>>>>> >>> >> >> >>> >> >> Florian
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--089e01294d96ca2aa904f1a8e7b5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Any luck with this?<br>Karl<br></div><div class=3D"gmail_e=
xtra"><br><br><div class=3D"gmail_quote">On Tue, Feb 4, 2014 at 4:15 PM, Ka=
rl Wright <span dir=3D"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" targe=
t=3D"_blank">daddywri@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;ve created a branch a=
t: <a href=3D"https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTO=
RS-880" target=3D"_blank">https://svn.apache.org/repos/asf/manifoldcf/branc=
hes/CONNECTORS-880</a> .=A0 This contains my proposed fix; please try it ou=
t.=A0 If you would like, I can also attach a patch, although I&#39;m not ce=
rtain it would apply properly onto MCF 1.4.1 sources.<span class=3D"HOEnZb"=
><font color=3D"#888888"><br>

<br>Karl<br><br></font></span></div><div class=3D"HOEnZb"><div class=3D"h5"=
><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, Feb =
4, 2014 at 2:37 PM, Karl Wright <span dir=3D"ltr">&lt;<a href=3D"mailto:dad=
dywri@gmail.com" target=3D"_blank">daddywri@gmail.com</a>&gt;</span> wrote:=
<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Florian,<br><br>I&#39;m =
pretty sure now that what is happening is that your output connector is thr=
owing some kind of exception when it is asked to remove documents during th=
e cleanup phase of the crawl.=A0 The state transitions in the framework see=
m to be incorrect under these conditions, and the error is likely not logge=
d into the job&#39;s error field.=A0 The ticket I&#39;ve created to address=
 this is CONNECTORS-880.<span><font color=3D"#888888"><br>


<br>Karl<br><br></font></span></div><div><div><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Tue, Feb 4, 2014 at 2:14 PM, Karl Wrigh=
t <span dir=3D"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_bl=
ank">daddywri@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>The code path for=
 an abort sequence looks pretty iron-clad.=A0 The bad-case output:<div>
<br><br>&gt;&gt;&gt;&gt;&gt;&gt;<br>DEBUG 2014-02-03 18:27:45,387 (Finisher=
 thread) - Marked job 1385573203052<br>
for shutdown<br>
DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job<br>
1385573203052 in need of notification<br>&lt;&lt;&lt;&lt;&lt;&lt;<br><br></=
div></div>is not including:<div><br><br>&gt;&gt;&gt;&gt;&gt;&gt;<br>DEBUG 2=
014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now<br>

completed<br>&lt;&lt;&lt;&lt;&lt;&lt;<br><br></div></div>is very significan=
t, because it is in that method that the last-check time would be updated t=
ypically, in the method JobManager.finishJob().=A0 If an abort took place, =
it would have started BEFORE all this; once the job state gets set to STATU=
S_SHUTTINGDOWN, there is no way that the job can be aborted either manually=
 or by repository-connector related activity.=A0 At that time the job is cl=
eaning up documents that are no longer reachable.=A0 I will check to see wh=
at happens if the output connector throws an exception during this phase; i=
t&#39;s the only thing I can think of that might potentially derail the job=
 from finishing.<span><font color=3D"#888888"><br>


<br>Karl<br><br></font></span></div><div><div><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Tue, Feb 4, 2014 at 1:29 PM, Karl Wrigh=
t <span dir=3D"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_bl=
ank">daddywri@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hi Florian,<=
br><br></div>The only way this can happen is if the proper job termination =
state sequence does not take place.=A0 When MCF checks to see if a job shou=
ld be started, if it determines that the answer is &quot;no&quot; it update=
s the job record immediately with a new &quot;last checked&quot; value.=A0 =
But if it starts the job, it waits for the job completion to take place bef=
ore updating the job&#39;s &quot;last checked&quot; time.=A0 When a job abo=
rts, at first glance it looks like it also does the right thing, but clearl=
y that&#39;s not true, and there must be a bug somewhere in how this condit=
ion is handled.<br>


<br></div>I&#39;ll create a ticket to research this. In the interim, I sugg=
est you figure out why your job is aborting in the first place.<br><br></di=
v>Thanks,<br>Karl<br></div><div><div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">
On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <span dir=3D"ltr">&lt;<a href=
=3D"mailto:daddywri@gmail.com" target=3D"_blank">daddywri@gmail.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr"><div><div><div><div>Hi Florian,<br><br></div>I do not expe=
ct errors to appear in the tomcat log.<br><br></div>But this is interesting=
:<br><br>Good:<div><br>&gt;&gt;&gt;&gt;&gt;&gt;<br>DEBUG 2014-02-03 16:00:0=
2,153 (Job start thread) - Checking if job<br>


1385573203052 needs to be started; it was last checked at 1391439592120,<br=
>and now it is 1391439602151<br>DEBUG 2014-02-03 16:00:02,153 (Job start th=
read) -=A0 Time match FOUND<br>within interval 1391439592120 to 13914396021=
51<br>


</div>
...<div><br>DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if =
job<br>1385573203052 needs to be started; it was last checked at 1391440412=
615,<br>and now it is 1391440427102<br>DEBUG 2014-02-03 16:13:47,105 (Job s=
tart thread) -=A0 No time match found<br>


within interval 1391440412615 to 1391440427102<br>&lt;&lt;&lt;&lt;&lt;&lt;<=
br></div>&quot;last checked&quot; time for job is updated.<br><br>Bad:<div>=
<br>&gt;&gt;&gt;&gt;&gt;&gt;<br>DEBUG 2014-02-03 18:00:04,109 (Job start th=
read) - Checking if job<br>


1385573203052 needs to be started; it was last checked at 1391446794075,<br=
>and now it is 1391446804106<br>DEBUG 2014-02-03 18:00:04,109 (Job start th=
read) -=A0 Time match FOUND<br>within interval 1391446794075 to 13914468041=
06<br>


</div>
...<div><br>DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if =
job<br>1385573203052 needs to be started; it was last checked at 1391446794=
075,<br>and now it is 1391447647733<br>DEBUG 2014-02-03 18:14:07,736 (Job s=
tart thread) -=A0 Time match FOUND<br>


within interval 1391446794075 to 1391447647733<br>&lt;&lt;&lt;&lt;&lt;&lt;<=
br></div>Note that the &quot;last checked&quot; time is NOT updated.<br><br=
></div>I don&#39;t understand why, in one case, the &quot;last checked&quot=
; time is being updated for the job, and is not in another case.=A0 I will =
look to see if there is any way in the code that this can happen.<span><fon=
t color=3D"#888888"><br>


<br></font></span></div><span><font color=3D"#888888">Karl<br><br></font></=
span></div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <span dir=3D"lt=
r">&lt;<a href=3D"mailto:schmeddi@informatik.uni-freiburg.de" target=3D"_bl=
ank">schmeddi@informatik.uni-freiburg.de</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Karl,<br>
<br>
there are no errors in the Tomcat logs. Currently, the Manifold log<br>
contains only the job log messages (&lt;property<br>
name=3D&quot;<a href=3D"http://org.apache.manifoldcf.jobs" target=3D"_blank=
">org.apache.manifoldcf.jobs</a>&quot; value=3D&quot;ALL&quot;/&gt;). I inc=
lude two log<br>
snippets, one from a normal run, and one where the job got repeated two<br>
times. I noticed the thread sequence &quot;Finisher - Job reset - Job<br>
notification&quot; when the job finally terminates, and the thread sequence=
<br>
&quot;Finisher - Job notification&quot; when the job gets restarted again i=
nstead of<br>
terminating.<br>
<br>
<br>
DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391439582108,<br=
>
and now it is 1391439592119<br>
DEBUG 2014-02-03 15:59:52,131 (Job start thread) - =A0No time match found<b=
r>
within interval 1391439582108 to 1391439592119<br>
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391439592120,<br=
>
and now it is 1391439602151<br>
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - =A0Time match FOUND<br>
within interval 1391439592120 to 1391439602151<br>
DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job &#39;1385573203052&#=
39; is<br>
within run window at 1391439602151 ms. (which starts at 1391439600000 ms.)<=
br>
DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job start<=
br>
for job 1385573203052<br>
DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job 1385573203052<b=
r>
for startup<br>
DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is now<b=
r>
started<br>
DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job 1385573203052<=
br>
for shutdown<br>
DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now<br=
>
completed<br>
DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job<br>
1385573203052 in need of notification<br>
DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391440412615,<br=
>
and now it is 1391440427102<br>
DEBUG 2014-02-03 16:13:47,105 (Job start thread) - =A0No time match found<b=
r>
within interval 1391440412615 to 1391440427102<br>
<br>
<br>
DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391446784053,<br=
>
and now it is 1391446794074<br>
DEBUG 2014-02-03 17:59:54,078 (Job start thread) - =A0No time match found<b=
r>
within interval 1391446784053 to 1391446794074<br>
DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391446794075,<br=
>
and now it is 1391446804106<br>
DEBUG 2014-02-03 18:00:04,109 (Job start thread) - =A0Time match FOUND<br>
within interval 1391446794075 to 1391446804106<br>
DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job &#39;1385573203052&#=
39; is<br>
within run window at 1391446804106 ms. (which starts at 1391446800000 ms.)<=
br>
DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job start<=
br>
for job 1385573203052<br>
DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job 1385573203052<b=
r>
for startup<br>
DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is now<b=
r>
started<br>
DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job 1385573203052<=
br>
for shutdown<br>
DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job<br>
1385573203052 in need of notification<br>
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391446794075,<br=
>
and now it is 1391447647733<br>
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - =A0Time match FOUND<br>
within interval 1391446794075 to 1391447647733<br>
DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job &#39;1385573203052&#=
39; is<br>
within run window at 1391447647733 ms. (which starts at 1391446800000 ms.)<=
br>
DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391446794075,<br=
>
and now it is 1391447657740<br>
DEBUG 2014-02-03 18:14:17,744 (Job start thread) - =A0Time match FOUND<br>
within interval 1391446794075 to 1391447657740<br>
DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job &#39;1385573203052&#=
39; is<br>
within run window at 1391447657740 ms. (which starts at 1391446800000 ms.)<=
br>
DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job start<=
br>
for job 1385573203052<br>
DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job 1385573203052<b=
r>
for startup<br>
DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is now<b=
r>
started<br>
DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job 1385573203052<=
br>
for shutdown<br>
DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job<br>
1385573203052 in need of notification<br>
DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391446794075,<br=
>
and now it is 1391448479353<br>
DEBUG 2014-02-03 18:27:59,358 (Job start thread) - =A0Time match FOUND<br>
within interval 1391446794075 to 1391448479353<br>
DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job &#39;1385573203052&#=
39; is<br>
within run window at 1391448479353 ms. (which starts at 1391446800000 ms.)<=
br>
DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job start<=
br>
for job 1385573203052<br>
DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job 1385573203052<b=
r>
for startup<br>
DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is now<b=
r>
started<br>
DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job 1385573203052<=
br>
for shutdown<br>
DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052 now<br=
>
completed<br>
DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job<br>
1385573203052 in need of notification<br>
DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job<br>
1385573203052 needs to be started; it was last checked at 1391449283114,<br=
>
and now it is 1391449292400<br>
DEBUG 2014-02-03 18:41:32,403 (Job start thread) - =A0No time match found<b=
r>
within interval 1391449283114 to 1391449292400<br>
<br>
<br>
Do you need another log output?<br>
<br>
Best,<br>
Florian<br>
<div><div><br>
&gt; Also, what does the log have to say? =A0If there is an error aborting =
the<br>
&gt; job, there should be some record of it in the manifoldcf.log.<br>
&gt;<br>
&gt; Thanks,<br>
&gt; Karl<br>
&gt;<br>
&gt;<br>
&gt; On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright &lt;<a href=3D"mailto:dadd=
ywri@gmail.com" target=3D"_blank">daddywri@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; Hi Florian,<br>
&gt;&gt;<br>
&gt;&gt; Please run the job manually, when outside the scheduling window or=
 with<br>
&gt;&gt; the scheduling off. =A0What is the reason for the job abort?<br>
&gt;&gt;<br>
&gt;&gt; Karl<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding &lt;<br>
&gt;&gt; <a href=3D"mailto:schmeddi@informatik.uni-freiburg.de" target=3D"_=
blank">schmeddi@informatik.uni-freiburg.de</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;&gt; Hi Karl,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; yes, I&#39;ve coincidentally seen &quot;Aborted&quot; in the e=
nd time column when I<br>
&gt;&gt;&gt; refreshed the job status just after the number of active docum=
ents was<br>
&gt;&gt;&gt; zero. At the next refresh the job was starting up. After looki=
ng in the<br>
&gt;&gt;&gt; history I found out that it even started a third time. You can=
 see the<br>
&gt;&gt;&gt; history of a single day below (job continue, end, start, stop,=
 unwait,<br>
&gt;&gt;&gt; wait). The start method is &quot;Start at beginning of schedul=
e window&quot;. Job<br>
&gt;&gt;&gt; invocation is &quot;complete&quot;. Hop count mode is &quot;De=
lete unreachable<br>
&gt;&gt;&gt; documents&quot;.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; 02.03.2014 18:41 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 18:28 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 18:14 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 18:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 17:49 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 17:27 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 17:13 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 17:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 16:13 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 16:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 15:41 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 15:27 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 15:14 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 15:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 14:13 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 14:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 13:13 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 13:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 12:27 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 12:14 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 12:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 11:13 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 11:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 10:13 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 10:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 09:29 =A0 =A0 =A0 =A0job end<br>
&gt;&gt;&gt; 02.03.2014 09:14 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt; 02.03.2014 09:00 =A0 =A0 =A0 =A0job start<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Best,<br>
&gt;&gt;&gt; Florian<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt; Jobs don&#39;t just abort randomly. =A0Are you sure that =
the job aborted?<br>
&gt;&gt;&gt; Or<br>
&gt;&gt;&gt; &gt; did<br>
&gt;&gt;&gt; &gt; it just restart?<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt; As for &quot;is this normal&quot;, it depends on how you =
have created your job.<br>
&gt;&gt;&gt; =A0If<br>
&gt;&gt;&gt; &gt; you selected the &quot;Start within schedule window&quot;=
 selection, MCF will<br>
&gt;&gt;&gt; &gt; restart<br>
&gt;&gt;&gt; &gt; the job whenever it finishes and run it until the end of =
the<br>
&gt;&gt;&gt; scheduling<br>
&gt;&gt;&gt; &gt; window.<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt; Karl<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt; On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding &lt;<=
br>
&gt;&gt;&gt; &gt; <a href=3D"mailto:schmeddi@informatik.uni-freiburg.de" ta=
rget=3D"_blank">schmeddi@informatik.uni-freiburg.de</a>&gt; wrote:<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; Hi Karl,<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; I&#39;ve just observed that the job was started accor=
ding to its<br>
&gt;&gt;&gt; schedule<br>
&gt;&gt;&gt; &gt;&gt; and<br>
&gt;&gt;&gt; &gt;&gt; crawled all documents correctly (I&#39;ve chosen to r=
e-ingest all<br>
&gt;&gt;&gt; documents<br>
&gt;&gt;&gt; &gt;&gt; before the run). However, after finishing the last do=
cument (zero<br>
&gt;&gt;&gt; active<br>
&gt;&gt;&gt; &gt;&gt; documents) it was somehow aborted and restarted immed=
iately. Is this<br>
&gt;&gt;&gt; an<br>
&gt;&gt;&gt; &gt;&gt; expected behavior?<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; Best,<br>
&gt;&gt;&gt; &gt;&gt; Florian<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt; Based on this schedule, your crawls will be able=
 to start whenever<br>
&gt;&gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt; hour<br>
&gt;&gt;&gt; &gt;&gt; &gt; turns. =A0So they can start every hour on the ho=
ur. =A0If the last<br>
&gt;&gt;&gt; crawl<br>
&gt;&gt;&gt; &gt;&gt; &gt; crossed an hour boundary, the next crawl will st=
art immediately, I<br>
&gt;&gt;&gt; &gt;&gt; &gt; believe.<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt; Karl<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt; On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedd=
ing &lt;<br>
&gt;&gt;&gt; &gt;&gt; &gt; <a href=3D"mailto:schmeddi@informatik.uni-freibu=
rg.de" target=3D"_blank">schmeddi@informatik.uni-freiburg.de</a>&gt; wrote:=
<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Hi Karl,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; these are the values:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Priority: =A0 =A0 =A0 5 =A0 =A0 =A0 Start me=
thod: =A0 Start at beginning of<br>
&gt;&gt;&gt; &gt;&gt; schedule<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; window<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Schedule type: =A0Scan every document once =
=A0 =A0 =A0 =A0Minimum recrawl<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; interval:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; =A0 =A0 =A0 Not<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; applicable<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Expiration interval: =A0 =A0Not applicable =
=A0Reseed interval:<br>
&gt;&gt;&gt; Not<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; applicable<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Scheduled time: =A0 =A0 =A0 =A0 Any day of w=
eek at 12 am 1 am 2 am 3 am 4<br>
&gt;&gt;&gt; am<br>
&gt;&gt;&gt; &gt;&gt; 5<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; am<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; 6 am 7<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 p=
m 4 pm 5 pm 6 pm 7 pm<br>
&gt;&gt;&gt; 8<br>
&gt;&gt;&gt; &gt;&gt; pm 9<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; pm 10 pm 11 pm<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Maximum run time: =A0 =A0 =A0 No limit =A0 =
=A0 =A0 =A0Job invocation:<br>
&gt;&gt;&gt; &gt;&gt; Complete<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Maybe it is because I&#39;ve changed the job=
 from continuous crawling<br>
&gt;&gt;&gt; to<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; this<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; schedule. I started it a few times manually,=
 too. I couldn&#39;t<br>
&gt;&gt;&gt; notice<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; anything strange in the job setup or in the =
respective entries in<br>
&gt;&gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; database.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Regards,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; Florian<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; I was unable to reproduce the behavior =
you described.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; Could you view your job, and post a scr=
een shot of that page?<br>
&gt;&gt;&gt; I<br>
&gt;&gt;&gt; &gt;&gt; want<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; to<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; see what your schedule record(s) look l=
ike.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; Thanks,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; Karl<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt; On Tue, Jan 14, 2014 at 6:09 AM, Karl W=
right<br>
&gt;&gt;&gt; &lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_blank">da=
ddywri@gmail.com</a>&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; wrote:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; I&#39;ve never noted this behavior =
before. =A0I&#39;ll see if I can<br>
&gt;&gt;&gt; &gt;&gt; reproduce<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; it<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; here.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; Karl<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; On Tue, Jan 14, 2014 at 5:36 AM, Fl=
orian Schmedding &lt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; <a href=3D"mailto:schmeddi@informat=
ik.uni-freiburg.de" target=3D"_blank">schmeddi@informatik.uni-freiburg.de</=
a>&gt; wrote:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; Hi Karl,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; the scheduled job seems to work=
 as expecetd. However, it runs<br>
&gt;&gt;&gt; two<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; times:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; It starts at the beginning of t=
he scheduled time, finishes,<br>
&gt;&gt;&gt; and<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; immediately starts again. After=
 finishing the second run it<br>
&gt;&gt;&gt; waits<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; for<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; next scheduled time. Why does i=
t run two times? The start<br>
&gt;&gt;&gt; method<br>
&gt;&gt;&gt; &gt;&gt; is<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &quot;Start<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; at beginning of schedule window=
&quot;.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; Yes, you&#39;re right about the=
 checking guarantee. Currently,<br>
&gt;&gt;&gt; our<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; interval<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; is<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; long enough for a complete craw=
ler run.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; Best,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; Florian<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; It is impossible to *guara=
ntee* that a document will be<br>
&gt;&gt;&gt; &gt;&gt; checked,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; because<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; if<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; load on the crawler is hig=
h enough, it will fall behind.<br>
&gt;&gt;&gt; But<br>
&gt;&gt;&gt; I<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; will<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; look<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; into adding the feature yo=
u request.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; Karl<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; On Sun, Jan 5, 2014 at 9:0=
8 AM, Florian Schmedding &lt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt; <a href=3D"mailto:schmeddi=
@informatik.uni-freiburg.de" target=3D"_blank">schmeddi@informatik.uni-frei=
burg.de</a>&gt; wrote:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; Hi Karl,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; yes, in our case it is=
 necessary to make sure that new<br>
&gt;&gt;&gt; &gt;&gt; documents<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; are<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; discovered and indexed=
 within a certain interval. I have<br>
&gt;&gt;&gt; &gt;&gt; created<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; a<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; feature<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; request on that. In th=
e meantime we will try to use a<br>
&gt;&gt;&gt; &gt;&gt; scheduled<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; job<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; instead.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; Thanks for your help,<=
br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; Florian<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; Hi Florian,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; What you are seei=
ng is &quot;dynamic crawling&quot; behavior. =A0The<br>
&gt;&gt;&gt; &gt;&gt; time<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; between<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; refetches of a do=
cument is based on the history of<br>
&gt;&gt;&gt; fetches<br>
&gt;&gt;&gt; &gt;&gt; of<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; that<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; document. =A0The =
recrawl interval is the initial time<br>
&gt;&gt;&gt; between<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; document<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; fetches, but if a=
 document does not change, the interval<br>
&gt;&gt;&gt; for<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; document<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; increases accordi=
ng to a formula.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; I would need to l=
ook at the code to be able to give you<br>
&gt;&gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; precise<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; formula, but if y=
ou need a limit on the amount of time<br>
&gt;&gt;&gt; &gt;&gt; between<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; document<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; fetch attempts, I=
 suggest you create a ticket and I will<br>
&gt;&gt;&gt; &gt;&gt; look<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; into<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; adding<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; that as a feature=
.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; Thanks,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; Karl<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; On Sat, Jan 4, 20=
14 at 7:56 AM, Florian Schmedding &lt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt; <a href=3D"mailto=
:schmeddi@informatik.uni-freiburg.de" target=3D"_blank">schmeddi@informatik=
.uni-freiburg.de</a>&gt; wrote:<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; Hello,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; the parameter=
s reseed interval and recrawl interval of<br>
&gt;&gt;&gt; a<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; continuous<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; crawling job =
are not quite clear to me. The<br>
&gt;&gt;&gt; documentation<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; tells<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; that<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; reseed interv=
al is the time after which the seeds are<br>
&gt;&gt;&gt; &gt;&gt; checked<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; again,<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; and<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; the recrawl i=
nterval is the time after which a document<br>
&gt;&gt;&gt; is<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; checked<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; for<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; changes.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; However, we o=
bserved that the recrawl interval for a<br>
&gt;&gt;&gt; &gt;&gt; document<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; increases<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; after each ch=
eck. On the other hand, the reseed<br>
&gt;&gt;&gt; interval<br>
&gt;&gt;&gt; &gt;&gt; seems<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; to<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; be<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; set<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; up correctly =
in the database metadata about the seed<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; documents.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; Yet<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; web server do=
es not receive requests at each time the<br>
&gt;&gt;&gt; &gt;&gt; interval<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; elapses<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; but<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; only after se=
veral intervals have elapsed.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; We are using =
a web connector. The web server does not<br>
&gt;&gt;&gt; tell<br>
&gt;&gt;&gt; &gt;&gt; the<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; client<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; to<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; cache the doc=
uments. Any help would be appreciated.<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; Best regards,=
<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt; Florian<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;&gt;<br>
&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--089e01294d96ca2aa904f1a8e7b5--