manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Tue, 04 Feb 2014 19:14:49 GMT
The code path for an abort sequence looks pretty iron-clad.  The bad-case
output:

>>>>>>
DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job 1385573203052
for shutdown
DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
1385573203052 in need of notification
<<<<<<

is not including:

>>>>>>
DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
completed
<<<<<<

is very significant, because it is in that method that the last-check time
would be updated typically, in the method JobManager.finishJob().  If an
abort took place, it would have started BEFORE all this; once the job state
gets set to STATUS_SHUTTINGDOWN, there is no way that the job can be
aborted either manually or by repository-connector related activity.  At
that time the job is cleaning up documents that are no longer reachable.  I
will check to see what happens if the output connector throws an exception
during this phase; it's the only thing I can think of that might
potentially derail the job from finishing.

Karl



On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Florian,
>
> The only way this can happen is if the proper job termination state
> sequence does not take place.  When MCF checks to see if a job should be
> started, if it determines that the answer is "no" it updates the job record
> immediately with a new "last checked" value.  But if it starts the job, it
> waits for the job completion to take place before updating the job's "last
> checked" time.  When a job aborts, at first glance it looks like it also
> does the right thing, but clearly that's not true, and there must be a bug
> somewhere in how this condition is handled.
>
> I'll create a ticket to research this. In the interim, I suggest you
> figure out why your job is aborting in the first place.
>
> Thanks,
> Karl
>
>
> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Florian,
>>
>> I do not expect errors to appear in the tomcat log.
>>
>> But this is interesting:
>>
>> Good:
>>
>> >>>>>>
>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391439592120,
>> and now it is 1391439602151
>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>> within interval 1391439592120 to 1391439602151
>>  ...
>>
>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391440412615,
>> and now it is 1391440427102
>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
>> within interval 1391440412615 to 1391440427102
>> <<<<<<
>> "last checked" time for job is updated.
>>
>> Bad:
>>
>> >>>>>>
>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391446804106
>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391446804106
>>  ...
>>
>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391447647733
>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391447647733
>> <<<<<<
>> Note that the "last checked" time is NOT updated.
>>
>> I don't understand why, in one case, the "last checked" time is being
>> updated for the job, and is not in another case.  I will look to see if
>> there is any way in the code that this can happen.
>>
>> Karl
>>
>>
>>
>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
>> schmeddi@informatik.uni-freiburg.de> wrote:
>>
>>> Hi Karl,
>>>
>>> there are no errors in the Tomcat logs. Currently, the Manifold log
>>> contains only the job log messages (<property
>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log
>>> snippets, one from a normal run, and one where the job got repeated two
>>> times. I noticed the thread sequence "Finisher - Job reset - Job
>>> notification" when the job finally terminates, and the thread sequence
>>> "Finisher - Job notification" when the job gets restarted again instead
>>> of
>>> terminating.
>>>
>>>
>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391439582108,
>>> and now it is 1391439592119
>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match found
>>> within interval 1391439582108 to 1391439592119
>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391439592120,
>>> and now it is 1391439602151
>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>> within interval 1391439592120 to 1391439602151
>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job '1385573203052' is
>>> within run window at 1391439602151 ms. (which starts at 1391439600000
>>> ms.)
>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job
>>> start
>>> for job 1385573203052
>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job 1385573203052
>>> for startup
>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is now
>>> started
>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
>>> completed
>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391440412615,
>>> and now it is 1391440427102
>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
>>> within interval 1391440412615 to 1391440427102
>>>
>>>
>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391446784053,
>>> and now it is 1391446794074
>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match found
>>> within interval 1391446784053 to 1391446794074
>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>>> and now it is 1391446804106
>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>> within interval 1391446794075 to 1391446804106
>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job '1385573203052' is
>>> within run window at 1391446804106 ms. (which starts at 1391446800000
>>> ms.)
>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job
>>> start
>>> for job 1385573203052
>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job 1385573203052
>>> for startup
>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is now
>>> started
>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>>> and now it is 1391447647733
>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>> within interval 1391446794075 to 1391447647733
>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job '1385573203052' is
>>> within run window at 1391447647733 ms. (which starts at 1391446800000
>>> ms.)
>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>>> and now it is 1391447657740
>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match FOUND
>>> within interval 1391446794075 to 1391447657740
>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job '1385573203052' is
>>> within run window at 1391447657740 ms. (which starts at 1391446800000
>>> ms.)
>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job
>>> start
>>> for job 1385573203052
>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job 1385573203052
>>> for startup
>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is now
>>> started
>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>>> and now it is 1391448479353
>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match FOUND
>>> within interval 1391446794075 to 1391448479353
>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job '1385573203052' is
>>> within run window at 1391448479353 ms. (which starts at 1391446800000
>>> ms.)
>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job
>>> start
>>> for job 1385573203052
>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job 1385573203052
>>> for startup
>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is now
>>> started
>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052 now
>>> completed
>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
>>> 1385573203052 needs to be started; it was last checked at 1391449283114,
>>> and now it is 1391449292400
>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match found
>>> within interval 1391449283114 to 1391449292400
>>>
>>>
>>> Do you need another log output?
>>>
>>> Best,
>>> Florian
>>>
>>> > Also, what does the log have to say?  If there is an error aborting the
>>> > job, there should be some record of it in the manifoldcf.log.
>>> >
>>> > Thanks,
>>> > Karl
>>> >
>>> >
>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <daddywri@gmail.com>
>>> wrote:
>>> >
>>> >> Hi Florian,
>>> >>
>>> >> Please run the job manually, when outside the scheduling window or
>>> with
>>> >> the scheduling off.  What is the reason for the job abort?
>>> >>
>>> >> Karl
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>>> >> schmeddi@informatik.uni-freiburg.de> wrote:
>>> >>
>>> >>> Hi Karl,
>>> >>>
>>> >>> yes, I've coincidentally seen "Aborted" in the end time column when
I
>>> >>> refreshed the job status just after the number of active documents
>>> was
>>> >>> zero. At the next refresh the job was starting up. After looking
in
>>> the
>>> >>> history I found out that it even started a third time. You can see
>>> the
>>> >>> history of a single day below (job continue, end, start, stop,
>>> unwait,
>>> >>> wait). The start method is "Start at beginning of schedule window".
>>> Job
>>> >>> invocation is "complete". Hop count mode is "Delete unreachable
>>> >>> documents".
>>> >>>
>>> >>> 02.03.2014 18:41        job end
>>> >>> 02.03.2014 18:28        job start
>>> >>> 02.03.2014 18:14        job start
>>> >>> 02.03.2014 18:00        job start
>>> >>> 02.03.2014 17:49        job end
>>> >>> 02.03.2014 17:27        job end
>>> >>> 02.03.2014 17:13        job start
>>> >>> 02.03.2014 17:00        job start
>>> >>> 02.03.2014 16:13        job end
>>> >>> 02.03.2014 16:00        job start
>>> >>> 02.03.2014 15:41        job end
>>> >>> 02.03.2014 15:27        job start
>>> >>> 02.03.2014 15:14        job start
>>> >>> 02.03.2014 15:00        job start
>>> >>> 02.03.2014 14:13        job end
>>> >>> 02.03.2014 14:00        job start
>>> >>> 02.03.2014 13:13        job end
>>> >>> 02.03.2014 13:00        job start
>>> >>> 02.03.2014 12:27        job end
>>> >>> 02.03.2014 12:14        job start
>>> >>> 02.03.2014 12:00        job start
>>> >>> 02.03.2014 11:13        job end
>>> >>> 02.03.2014 11:00        job start
>>> >>> 02.03.2014 10:13        job end
>>> >>> 02.03.2014 10:00        job start
>>> >>> 02.03.2014 09:29        job end
>>> >>> 02.03.2014 09:14        job start
>>> >>> 02.03.2014 09:00        job start
>>> >>>
>>> >>> Best,
>>> >>> Florian
>>> >>>
>>> >>>
>>> >>> > Hi Florian,
>>> >>> >
>>> >>> > Jobs don't just abort randomly.  Are you sure that the job
aborted?
>>> >>> Or
>>> >>> > did
>>> >>> > it just restart?
>>> >>> >
>>> >>> > As for "is this normal", it depends on how you have created
your
>>> job.
>>> >>>  If
>>> >>> > you selected the "Start within schedule window" selection,
MCF will
>>> >>> > restart
>>> >>> > the job whenever it finishes and run it until the end of the
>>> >>> scheduling
>>> >>> > window.
>>> >>> >
>>> >>> > Karl
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>>> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>> >>> >
>>> >>> >> Hi Karl,
>>> >>> >>
>>> >>> >> I've just observed that the job was started according to
its
>>> >>> schedule
>>> >>> >> and
>>> >>> >> crawled all documents correctly (I've chosen to re-ingest
all
>>> >>> documents
>>> >>> >> before the run). However, after finishing the last document
(zero
>>> >>> active
>>> >>> >> documents) it was somehow aborted and restarted immediately.
Is
>>> this
>>> >>> an
>>> >>> >> expected behavior?
>>> >>> >>
>>> >>> >> Best,
>>> >>> >> Florian
>>> >>> >>
>>> >>> >>
>>> >>> >> > Hi Florian,
>>> >>> >> >
>>> >>> >> > Based on this schedule, your crawls will be able to
start
>>> whenever
>>> >>> the
>>> >>> >> > hour
>>> >>> >> > turns.  So they can start every hour on the hour.
 If the last
>>> >>> crawl
>>> >>> >> > crossed an hour boundary, the next crawl will start
>>> immediately, I
>>> >>> >> > believe.
>>> >>> >> >
>>> >>> >> > Karl
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding
<
>>> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
>>> >>> >> >
>>> >>> >> >> Hi Karl,
>>> >>> >> >>
>>> >>> >> >> these are the values:
>>> >>> >> >> Priority:       5       Start method:   Start
at beginning of
>>> >>> >> schedule
>>> >>> >> >> window
>>> >>> >> >> Schedule type:  Scan every document once     
  Minimum recrawl
>>> >>> >> >> interval:
>>> >>> >> >>       Not
>>> >>> >> >> applicable
>>> >>> >> >> Expiration interval:    Not applicable  Reseed
interval:
>>> >>> Not
>>> >>> >> >> applicable
>>> >>> >> >> Scheduled time:         Any day of week at 12
am 1 am 2 am 3
>>> am 4
>>> >>> am
>>> >>> >> 5
>>> >>> >> >> am
>>> >>> >> >> 6 am 7
>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm
4 pm 5 pm 6 pm 7
>>> pm
>>> >>> 8
>>> >>> >> pm 9
>>> >>> >> >> pm 10 pm 11 pm
>>> >>> >> >> Maximum run time:       No limit        Job invocation:
>>> >>> >> Complete
>>> >>> >> >>
>>> >>> >> >> Maybe it is because I've changed the job from
continuous
>>> crawling
>>> >>> to
>>> >>> >> >> this
>>> >>> >> >> schedule. I started it a few times manually, too.
I couldn't
>>> >>> notice
>>> >>> >> >> anything strange in the job setup or in the respective
entries
>>> in
>>> >>> the
>>> >>> >> >> database.
>>> >>> >> >>
>>> >>> >> >> Regards,
>>> >>> >> >> Florian
>>> >>> >> >>
>>> >>> >> >> > Hi Florian,
>>> >>> >> >> >
>>> >>> >> >> > I was unable to reproduce the behavior you
described.
>>> >>> >> >> >
>>> >>> >> >> > Could you view your job, and post a screen
shot of that page?
>>> >>> I
>>> >>> >> want
>>> >>> >> >> to
>>> >>> >> >> > see what your schedule record(s) look like.
>>> >>> >> >> >
>>> >>> >> >> > Thanks,
>>> >>> >> >> > Karl
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
>>> >>> <daddywri@gmail.com>
>>> >>> >> >> wrote:
>>> >>> >> >> >
>>> >>> >> >> >> Hi Florian,
>>> >>> >> >> >>
>>> >>> >> >> >> I've never noted this behavior before.
 I'll see if I can
>>> >>> >> reproduce
>>> >>> >> >> it
>>> >>> >> >> >> here.
>>> >>> >> >> >>
>>> >>> >> >> >> Karl
>>> >>> >> >> >>
>>> >>> >> >> >>
>>> >>> >> >> >>
>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian
Schmedding <
>>> >>> >> >> >> schmeddi@informatik.uni-freiburg.de>
wrote:
>>> >>> >> >> >>
>>> >>> >> >> >>> Hi Karl,
>>> >>> >> >> >>>
>>> >>> >> >> >>> the scheduled job seems to work as
expecetd. However, it
>>> runs
>>> >>> two
>>> >>> >> >> >>> times:
>>> >>> >> >> >>> It starts at the beginning of the
scheduled time, finishes,
>>> >>> and
>>> >>> >> >> >>> immediately starts again. After finishing
the second run it
>>> >>> waits
>>> >>> >> >> for
>>> >>> >> >> >>> the
>>> >>> >> >> >>> next scheduled time. Why does it
run two times? The start
>>> >>> method
>>> >>> >> is
>>> >>> >> >> >>> "Start
>>> >>> >> >> >>> at beginning of schedule window".
>>> >>> >> >> >>>
>>> >>> >> >> >>> Yes, you're right about the checking
guarantee. Currently,
>>> >>> our
>>> >>> >> >> interval
>>> >>> >> >> >>> is
>>> >>> >> >> >>> long enough for a complete crawler
run.
>>> >>> >> >> >>>
>>> >>> >> >> >>> Best,
>>> >>> >> >> >>> Florian
>>> >>> >> >> >>>
>>> >>> >> >> >>>
>>> >>> >> >> >>> > Hi Florian,
>>> >>> >> >> >>> >
>>> >>> >> >> >>> > It is impossible to *guarantee*
that a document will be
>>> >>> >> checked,
>>> >>> >> >> >>> because
>>> >>> >> >> >>> > if
>>> >>> >> >> >>> > load on the crawler is high
enough, it will fall behind.
>>> >>> But
>>> >>> I
>>> >>> >> >> will
>>> >>> >> >> >>> look
>>> >>> >> >> >>> > into adding the feature you
request.
>>> >>> >> >> >>> >
>>> >>> >> >> >>> > Karl
>>> >>> >> >> >>> >
>>> >>> >> >> >>> >
>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08
AM, Florian Schmedding <
>>> >>> >> >> >>> > schmeddi@informatik.uni-freiburg.de>
wrote:
>>> >>> >> >> >>> >
>>> >>> >> >> >>> >> Hi Karl,
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >> yes, in our case it is necessary
to make sure that new
>>> >>> >> documents
>>> >>> >> >> are
>>> >>> >> >> >>> >> discovered and indexed within
a certain interval. I have
>>> >>> >> created
>>> >>> >> >> a
>>> >>> >> >> >>> >> feature
>>> >>> >> >> >>> >> request on that. In the
meantime we will try to use a
>>> >>> >> scheduled
>>> >>> >> >> job
>>> >>> >> >> >>> >> instead.
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >> Thanks for your help,
>>> >>> >> >> >>> >> Florian
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >> > Hi Florian,
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> > What you are seeing
is "dynamic crawling" behavior.
>>>  The
>>> >>> >> time
>>> >>> >> >> >>> between
>>> >>> >> >> >>> >> > refetches of a document
is based on the history of
>>> >>> fetches
>>> >>> >> of
>>> >>> >> >> that
>>> >>> >> >> >>> >> > document.  The recrawl
interval is the initial time
>>> >>> between
>>> >>> >> >> >>> document
>>> >>> >> >> >>> >> > fetches, but if a document
does not change, the
>>> interval
>>> >>> for
>>> >>> >> >> the
>>> >>> >> >> >>> >> document
>>> >>> >> >> >>> >> > increases according
to a formula.
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> > I would need to look
at the code to be able to give
>>> you
>>> >>> the
>>> >>> >> >> >>> precise
>>> >>> >> >> >>> >> > formula, but if you
need a limit on the amount of time
>>> >>> >> between
>>> >>> >> >> >>> >> document
>>> >>> >> >> >>> >> > fetch attempts, I suggest
you create a ticket and I
>>> will
>>> >>> >> look
>>> >>> >> >> into
>>> >>> >> >> >>> >> adding
>>> >>> >> >> >>> >> > that as a feature.
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> > Thanks,
>>> >>> >> >> >>> >> > Karl
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014
at 7:56 AM, Florian Schmedding <
>>> >>> >> >> >>> >> > schmeddi@informatik.uni-freiburg.de>
wrote:
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >> >> Hello,
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >> the parameters
reseed interval and recrawl interval
>>> of
>>> >>> a
>>> >>> >> >> >>> continuous
>>> >>> >> >> >>> >> >> crawling job are
not quite clear to me. The
>>> >>> documentation
>>> >>> >> >> tells
>>> >>> >> >> >>> that
>>> >>> >> >> >>> >> the
>>> >>> >> >> >>> >> >> reseed interval
is the time after which the seeds are
>>> >>> >> checked
>>> >>> >> >> >>> again,
>>> >>> >> >> >>> >> and
>>> >>> >> >> >>> >> >> the recrawl interval
is the time after which a
>>> document
>>> >>> is
>>> >>> >> >> >>> checked
>>> >>> >> >> >>> >> for
>>> >>> >> >> >>> >> >> changes.
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >> However, we observed
that the recrawl interval for a
>>> >>> >> document
>>> >>> >> >> >>> >> increases
>>> >>> >> >> >>> >> >> after each check.
On the other hand, the reseed
>>> >>> interval
>>> >>> >> seems
>>> >>> >> >> to
>>> >>> >> >> >>> be
>>> >>> >> >> >>> >> set
>>> >>> >> >> >>> >> >> up correctly in
the database metadata about the seed
>>> >>> >> >> documents.
>>> >>> >> >> >>> Yet
>>> >>> >> >> >>> >> the
>>> >>> >> >> >>> >> >> web server does
not receive requests at each time the
>>> >>> >> interval
>>> >>> >> >> >>> >> elapses
>>> >>> >> >> >>> >> >> but
>>> >>> >> >> >>> >> >> only after several
intervals have elapsed.
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >> We are using a
web connector. The web server does not
>>> >>> tell
>>> >>> >> the
>>> >>> >> >> >>> client
>>> >>> >> >> >>> >> to
>>> >>> >> >> >>> >> >> cache the documents.
Any help would be appreciated.
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >> Best regards,
>>> >>> >> >> >>> >> >> Florian
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >>
>>> >>> >> >> >>> >> >
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >>
>>> >>> >> >> >>> >
>>> >>> >> >> >>>
>>> >>> >> >> >>>
>>> >>> >> >> >>>
>>> >>> >> >> >>
>>> >>> >> >> >
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>
>

Mime
View raw message