manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Wed, 05 Feb 2014 13:40:52 GMT
Any luck with this?
Karl


On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <daddywri@gmail.com> wrote:

> I've created a branch at:
> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 .
> This contains my proposed fix; please try it out.  If you would like, I can
> also attach a patch, although I'm not certain it would apply properly onto
> MCF 1.4.1 sources.
>
> Karl
>
>
>
> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Florian,
>>
>> I'm pretty sure now that what is happening is that your output connector
>> is throwing some kind of exception when it is asked to remove documents
>> during the cleanup phase of the crawl.  The state transitions in the
>> framework seem to be incorrect under these conditions, and the error is
>> likely not logged into the job's error field.  The ticket I've created to
>> address this is CONNECTORS-880.
>>
>> Karl
>>
>>
>>
>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The code path for an abort sequence looks pretty iron-clad.  The
>>> bad-case output:
>>>
>>>
>>> >>>>>>
>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>> 1385573203052
>>> for shutdown
>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>> 1385573203052 in need of notification
>>> <<<<<<
>>>
>>> is not including:
>>>
>>>
>>> >>>>>>
>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
>>> completed
>>> <<<<<<
>>>
>>> is very significant, because it is in that method that the last-check
>>> time would be updated typically, in the method JobManager.finishJob().  If
>>> an abort took place, it would have started BEFORE all this; once the job
>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job can be
>>> aborted either manually or by repository-connector related activity.  At
>>> that time the job is cleaning up documents that are no longer reachable.  I
>>> will check to see what happens if the output connector throws an exception
>>> during this phase; it's the only thing I can think of that might
>>> potentially derail the job from finishing.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Florian,
>>>>
>>>> The only way this can happen is if the proper job termination state
>>>> sequence does not take place.  When MCF checks to see if a job should be
>>>> started, if it determines that the answer is "no" it updates the job record
>>>> immediately with a new "last checked" value.  But if it starts the job, it
>>>> waits for the job completion to take place before updating the job's "last
>>>> checked" time.  When a job aborts, at first glance it looks like it also
>>>> does the right thing, but clearly that's not true, and there must be a bug
>>>> somewhere in how this condition is handled.
>>>>
>>>> I'll create a ticket to research this. In the interim, I suggest you
>>>> figure out why your job is aborting in the first place.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <daddywri@gmail.com>wrote:
>>>>
>>>>> Hi Florian,
>>>>>
>>>>> I do not expect errors to appear in the tomcat log.
>>>>>
>>>>> But this is interesting:
>>>>>
>>>>> Good:
>>>>>
>>>>> >>>>>>
>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391439592120,
>>>>> and now it is 1391439602151
>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>>>> within interval 1391439592120 to 1391439602151
>>>>>  ...
>>>>>
>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391440412615,
>>>>> and now it is 1391440427102
>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
>>>>> within interval 1391440412615 to 1391440427102
>>>>> <<<<<<
>>>>> "last checked" time for job is updated.
>>>>>
>>>>> Bad:
>>>>>
>>>>> >>>>>>
>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391446794075,
>>>>> and now it is 1391446804106
>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>>>> within interval 1391446794075 to 1391446804106
>>>>>  ...
>>>>>
>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>> 1385573203052 needs to be started; it was last checked at
>>>>> 1391446794075,
>>>>> and now it is 1391447647733
>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>>>> within interval 1391446794075 to 1391447647733
>>>>> <<<<<<
>>>>> Note that the "last checked" time is NOT updated.
>>>>>
>>>>> I don't understand why, in one case, the "last checked" time is being
>>>>> updated for the job, and is not in another case.  I will look to see
if
>>>>> there is any way in the code that this can happen.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
>>>>> schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log
>>>>>> contains only the job log messages (<property
>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two
log
>>>>>> snippets, one from a normal run, and one where the job got repeated
>>>>>> two
>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job
>>>>>> notification" when the job finally terminates, and the thread sequence
>>>>>> "Finisher - Job notification" when the job gets restarted again
>>>>>> instead of
>>>>>> terminating.
>>>>>>
>>>>>>
>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391439582108,
>>>>>> and now it is 1391439592119
>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391439582108 to 1391439592119
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391439592120,
>>>>>> and now it is 1391439602151
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391439592120 to 1391439602151
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391439602151 ms. (which starts at 1391439600000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for
job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052
is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052
>>>>>> now
>>>>>> completed
>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391440412615,
>>>>>> and now it is 1391440427102
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391440412615 to 1391440427102
>>>>>>
>>>>>>
>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446784053,
>>>>>> and now it is 1391446794074
>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391446784053 to 1391446794074
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391446804106
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391446804106
>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391446804106 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for
job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052
is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391447647733
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391447647733
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391447647733 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391447657740
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391447657740
>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391447657740 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for
job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052
is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391448479353
>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391448479353
>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job
>>>>>> '1385573203052' is
>>>>>> within run window at 1391448479353 ms. (which starts at 1391446800000
>>>>>> ms.)
>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for
job
>>>>>> start
>>>>>> for job 1385573203052
>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job
>>>>>> 1385573203052
>>>>>> for startup
>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052
is
>>>>>> now
>>>>>> started
>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
>>>>>> 1385573203052
>>>>>> for shutdown
>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052
>>>>>> now
>>>>>> completed
>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
>>>>>> 1385573203052 in need of notification
>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391449283114,
>>>>>> and now it is 1391449292400
>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391449283114 to 1391449292400
>>>>>>
>>>>>>
>>>>>> Do you need another log output?
>>>>>>
>>>>>> Best,
>>>>>> Florian
>>>>>>
>>>>>> > Also, what does the log have to say?  If there is an error aborting
>>>>>> the
>>>>>> > job, there should be some record of it in the manifoldcf.log.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Karl
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi Florian,
>>>>>> >>
>>>>>> >> Please run the job manually, when outside the scheduling
window or
>>>>>> with
>>>>>> >> the scheduling off.  What is the reason for the job abort?
>>>>>> >>
>>>>>> >> Karl
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>>>>>> >> schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>
>>>>>> >>> Hi Karl,
>>>>>> >>>
>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time
column
>>>>>> when I
>>>>>> >>> refreshed the job status just after the number of active
>>>>>> documents was
>>>>>> >>> zero. At the next refresh the job was starting up. After
looking
>>>>>> in the
>>>>>> >>> history I found out that it even started a third time.
You can
>>>>>> see the
>>>>>> >>> history of a single day below (job continue, end, start,
stop,
>>>>>> unwait,
>>>>>> >>> wait). The start method is "Start at beginning of schedule
>>>>>> window". Job
>>>>>> >>> invocation is "complete". Hop count mode is "Delete
unreachable
>>>>>> >>> documents".
>>>>>> >>>
>>>>>> >>> 02.03.2014 18:41        job end
>>>>>> >>> 02.03.2014 18:28        job start
>>>>>> >>> 02.03.2014 18:14        job start
>>>>>> >>> 02.03.2014 18:00        job start
>>>>>> >>> 02.03.2014 17:49        job end
>>>>>> >>> 02.03.2014 17:27        job end
>>>>>> >>> 02.03.2014 17:13        job start
>>>>>> >>> 02.03.2014 17:00        job start
>>>>>> >>> 02.03.2014 16:13        job end
>>>>>> >>> 02.03.2014 16:00        job start
>>>>>> >>> 02.03.2014 15:41        job end
>>>>>> >>> 02.03.2014 15:27        job start
>>>>>> >>> 02.03.2014 15:14        job start
>>>>>> >>> 02.03.2014 15:00        job start
>>>>>> >>> 02.03.2014 14:13        job end
>>>>>> >>> 02.03.2014 14:00        job start
>>>>>> >>> 02.03.2014 13:13        job end
>>>>>> >>> 02.03.2014 13:00        job start
>>>>>> >>> 02.03.2014 12:27        job end
>>>>>> >>> 02.03.2014 12:14        job start
>>>>>> >>> 02.03.2014 12:00        job start
>>>>>> >>> 02.03.2014 11:13        job end
>>>>>> >>> 02.03.2014 11:00        job start
>>>>>> >>> 02.03.2014 10:13        job end
>>>>>> >>> 02.03.2014 10:00        job start
>>>>>> >>> 02.03.2014 09:29        job end
>>>>>> >>> 02.03.2014 09:14        job start
>>>>>> >>> 02.03.2014 09:00        job start
>>>>>> >>>
>>>>>> >>> Best,
>>>>>> >>> Florian
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> > Hi Florian,
>>>>>> >>> >
>>>>>> >>> > Jobs don't just abort randomly.  Are you sure that
the job
>>>>>> aborted?
>>>>>> >>> Or
>>>>>> >>> > did
>>>>>> >>> > it just restart?
>>>>>> >>> >
>>>>>> >>> > As for "is this normal", it depends on how you
have created
>>>>>> your job.
>>>>>> >>>  If
>>>>>> >>> > you selected the "Start within schedule window"
selection, MCF
>>>>>> will
>>>>>> >>> > restart
>>>>>> >>> > the job whenever it finishes and run it until the
end of the
>>>>>> >>> scheduling
>>>>>> >>> > window.
>>>>>> >>> >
>>>>>> >>> > Karl
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding
<
>>>>>> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>>>>> >>> >
>>>>>> >>> >> Hi Karl,
>>>>>> >>> >>
>>>>>> >>> >> I've just observed that the job was started
according to its
>>>>>> >>> schedule
>>>>>> >>> >> and
>>>>>> >>> >> crawled all documents correctly (I've chosen
to re-ingest all
>>>>>> >>> documents
>>>>>> >>> >> before the run). However, after finishing the
last document
>>>>>> (zero
>>>>>> >>> active
>>>>>> >>> >> documents) it was somehow aborted and restarted
immediately.
>>>>>> Is this
>>>>>> >>> an
>>>>>> >>> >> expected behavior?
>>>>>> >>> >>
>>>>>> >>> >> Best,
>>>>>> >>> >> Florian
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> > Hi Florian,
>>>>>> >>> >> >
>>>>>> >>> >> > Based on this schedule, your crawls will
be able to start
>>>>>> whenever
>>>>>> >>> the
>>>>>> >>> >> > hour
>>>>>> >>> >> > turns.  So they can start every hour on
the hour.  If the
>>>>>> last
>>>>>> >>> crawl
>>>>>> >>> >> > crossed an hour boundary, the next crawl
will start
>>>>>> immediately, I
>>>>>> >>> >> > believe.
>>>>>> >>> >> >
>>>>>> >>> >> > Karl
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian
Schmedding <
>>>>>> >>> >> > schmeddi@informatik.uni-freiburg.de>
wrote:
>>>>>> >>> >> >
>>>>>> >>> >> >> Hi Karl,
>>>>>> >>> >> >>
>>>>>> >>> >> >> these are the values:
>>>>>> >>> >> >> Priority:       5       Start method:
  Start at beginning
>>>>>> of
>>>>>> >>> >> schedule
>>>>>> >>> >> >> window
>>>>>> >>> >> >> Schedule type:  Scan every document
once        Minimum
>>>>>> recrawl
>>>>>> >>> >> >> interval:
>>>>>> >>> >> >>       Not
>>>>>> >>> >> >> applicable
>>>>>> >>> >> >> Expiration interval:    Not applicable
 Reseed interval:
>>>>>> >>> Not
>>>>>> >>> >> >> applicable
>>>>>> >>> >> >> Scheduled time:         Any day of
week at 12 am 1 am 2 am
>>>>>> 3 am 4
>>>>>> >>> am
>>>>>> >>> >> 5
>>>>>> >>> >> >> am
>>>>>> >>> >> >> 6 am 7
>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm
2 pm 3 pm 4 pm 5 pm 6
>>>>>> pm 7 pm
>>>>>> >>> 8
>>>>>> >>> >> pm 9
>>>>>> >>> >> >> pm 10 pm 11 pm
>>>>>> >>> >> >> Maximum run time:       No limit 
      Job invocation:
>>>>>> >>> >> Complete
>>>>>> >>> >> >>
>>>>>> >>> >> >> Maybe it is because I've changed the
job from continuous
>>>>>> crawling
>>>>>> >>> to
>>>>>> >>> >> >> this
>>>>>> >>> >> >> schedule. I started it a few times
manually, too. I couldn't
>>>>>> >>> notice
>>>>>> >>> >> >> anything strange in the job setup
or in the respective
>>>>>> entries in
>>>>>> >>> the
>>>>>> >>> >> >> database.
>>>>>> >>> >> >>
>>>>>> >>> >> >> Regards,
>>>>>> >>> >> >> Florian
>>>>>> >>> >> >>
>>>>>> >>> >> >> > Hi Florian,
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > I was unable to reproduce the
behavior you described.
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > Could you view your job, and
post a screen shot of that
>>>>>> page?
>>>>>> >>> I
>>>>>> >>> >> want
>>>>>> >>> >> >> to
>>>>>> >>> >> >> > see what your schedule record(s)
look like.
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > Thanks,
>>>>>> >>> >> >> > Karl
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09
AM, Karl Wright
>>>>>> >>> <daddywri@gmail.com>
>>>>>> >>> >> >> wrote:
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >> Hi Florian,
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> I've never noted this behavior
before.  I'll see if I can
>>>>>> >>> >> reproduce
>>>>>> >>> >> >> it
>>>>>> >>> >> >> >> here.
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> Karl
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36
AM, Florian Schmedding <
>>>>>> >>> >> >> >> schmeddi@informatik.uni-freiburg.de>
wrote:
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >>> Hi Karl,
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> the scheduled job seems
to work as expecetd. However,
>>>>>> it runs
>>>>>> >>> two
>>>>>> >>> >> >> >>> times:
>>>>>> >>> >> >> >>> It starts at the beginning
of the scheduled time,
>>>>>> finishes,
>>>>>> >>> and
>>>>>> >>> >> >> >>> immediately starts again.
After finishing the second
>>>>>> run it
>>>>>> >>> waits
>>>>>> >>> >> >> for
>>>>>> >>> >> >> >>> the
>>>>>> >>> >> >> >>> next scheduled time.
Why does it run two times? The
>>>>>> start
>>>>>> >>> method
>>>>>> >>> >> is
>>>>>> >>> >> >> >>> "Start
>>>>>> >>> >> >> >>> at beginning of schedule
window".
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> Yes, you're right about
the checking guarantee.
>>>>>> Currently,
>>>>>> >>> our
>>>>>> >>> >> >> interval
>>>>>> >>> >> >> >>> is
>>>>>> >>> >> >> >>> long enough for a complete
crawler run.
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> Best,
>>>>>> >>> >> >> >>> Florian
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>> > Hi Florian,
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > It is impossible
to *guarantee* that a document will
>>>>>> be
>>>>>> >>> >> checked,
>>>>>> >>> >> >> >>> because
>>>>>> >>> >> >> >>> > if
>>>>>> >>> >> >> >>> > load on the crawler
is high enough, it will fall
>>>>>> behind.
>>>>>> >>> But
>>>>>> >>> I
>>>>>> >>> >> >> will
>>>>>> >>> >> >> >>> look
>>>>>> >>> >> >> >>> > into adding the
feature you request.
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > Karl
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014
at 9:08 AM, Florian Schmedding <
>>>>>> >>> >> >> >>> > schmeddi@informatik.uni-freiburg.de>
wrote:
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>> >> Hi Karl,
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> yes, in our
case it is necessary to make sure that
>>>>>> new
>>>>>> >>> >> documents
>>>>>> >>> >> >> are
>>>>>> >>> >> >> >>> >> discovered and
indexed within a certain interval. I
>>>>>> have
>>>>>> >>> >> created
>>>>>> >>> >> >> a
>>>>>> >>> >> >> >>> >> feature
>>>>>> >>> >> >> >>> >> request on that.
In the meantime we will try to use a
>>>>>> >>> >> scheduled
>>>>>> >>> >> >> job
>>>>>> >>> >> >> >>> >> instead.
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> Thanks for your
help,
>>>>>> >>> >> >> >>> >> Florian
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >> > Hi Florian,
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > What you
are seeing is "dynamic crawling"
>>>>>> behavior.  The
>>>>>> >>> >> time
>>>>>> >>> >> >> >>> between
>>>>>> >>> >> >> >>> >> > refetches
of a document is based on the history of
>>>>>> >>> fetches
>>>>>> >>> >> of
>>>>>> >>> >> >> that
>>>>>> >>> >> >> >>> >> > document.
 The recrawl interval is the initial time
>>>>>> >>> between
>>>>>> >>> >> >> >>> document
>>>>>> >>> >> >> >>> >> > fetches,
but if a document does not change, the
>>>>>> interval
>>>>>> >>> for
>>>>>> >>> >> >> the
>>>>>> >>> >> >> >>> >> document
>>>>>> >>> >> >> >>> >> > increases
according to a formula.
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > I would
need to look at the code to be able to
>>>>>> give you
>>>>>> >>> the
>>>>>> >>> >> >> >>> precise
>>>>>> >>> >> >> >>> >> > formula,
but if you need a limit on the amount of
>>>>>> time
>>>>>> >>> >> between
>>>>>> >>> >> >> >>> >> document
>>>>>> >>> >> >> >>> >> > fetch attempts,
I suggest you create a ticket and
>>>>>> I will
>>>>>> >>> >> look
>>>>>> >>> >> >> into
>>>>>> >>> >> >> >>> >> adding
>>>>>> >>> >> >> >>> >> > that as
a feature.
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > Thanks,
>>>>>> >>> >> >> >>> >> > Karl
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> > On Sat,
Jan 4, 2014 at 7:56 AM, Florian Schmedding
>>>>>> <
>>>>>> >>> >> >> >>> >> > schmeddi@informatik.uni-freiburg.de>
wrote:
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >> >> Hello,
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> the
parameters reseed interval and recrawl
>>>>>> interval of
>>>>>> >>> a
>>>>>> >>> >> >> >>> continuous
>>>>>> >>> >> >> >>> >> >> crawling
job are not quite clear to me. The
>>>>>> >>> documentation
>>>>>> >>> >> >> tells
>>>>>> >>> >> >> >>> that
>>>>>> >>> >> >> >>> >> the
>>>>>> >>> >> >> >>> >> >> reseed
interval is the time after which the seeds
>>>>>> are
>>>>>> >>> >> checked
>>>>>> >>> >> >> >>> again,
>>>>>> >>> >> >> >>> >> and
>>>>>> >>> >> >> >>> >> >> the
recrawl interval is the time after which a
>>>>>> document
>>>>>> >>> is
>>>>>> >>> >> >> >>> checked
>>>>>> >>> >> >> >>> >> for
>>>>>> >>> >> >> >>> >> >> changes.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> However,
we observed that the recrawl interval
>>>>>> for a
>>>>>> >>> >> document
>>>>>> >>> >> >> >>> >> increases
>>>>>> >>> >> >> >>> >> >> after
each check. On the other hand, the reseed
>>>>>> >>> interval
>>>>>> >>> >> seems
>>>>>> >>> >> >> to
>>>>>> >>> >> >> >>> be
>>>>>> >>> >> >> >>> >> set
>>>>>> >>> >> >> >>> >> >> up
correctly in the database metadata about the
>>>>>> seed
>>>>>> >>> >> >> documents.
>>>>>> >>> >> >> >>> Yet
>>>>>> >>> >> >> >>> >> the
>>>>>> >>> >> >> >>> >> >> web
server does not receive requests at each time
>>>>>> the
>>>>>> >>> >> interval
>>>>>> >>> >> >> >>> >> elapses
>>>>>> >>> >> >> >>> >> >> but
>>>>>> >>> >> >> >>> >> >> only
after several intervals have elapsed.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> We
are using a web connector. The web server does
>>>>>> not
>>>>>> >>> tell
>>>>>> >>> >> the
>>>>>> >>> >> >> >>> client
>>>>>> >>> >> >> >>> >> to
>>>>>> >>> >> >> >>> >> >> cache
the documents. Any help would be
>>>>>> appreciated.
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >> Best
regards,
>>>>>> >>> >> >> >>> >> >> Florian
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >>
>>>>>> >>> >> >> >>> >> >
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >>
>>>>>> >>> >> >> >>> >
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>>
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message