manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Wed, 05 Feb 2014 15:11:28 GMT
Hi Florian,

That's the whole point; the exception is taking place but not being
properly logged due to a bug.  That's why it has been so confusing.
CONNECTORS-880 supposedly fixes the bug at least, but not the cause of the
underlying exception that is triggering it.


Karl



On Wed, Feb 5, 2014 at 10:07 AM, Florian Schmedding <
schmeddi@informatik.uni-freiburg.de> wrote:

> Hi Karl,
>
> thanks for the fix. However, it is a bit difficult to try it because I do
> not have a test system with the same setup. Before doing it I'm going to
> log all output from Manifold to check if there is some error visible when
> a job completes and restarts unexpectedly.
>
> Best,
> Florian
>
>
> > Any luck with this?
> > Karl
> >
> >
> > On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <daddywri@gmail.com> wrote:
> >
> >> I've created a branch at:
> >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 .
> >> This contains my proposed fix; please try it out.  If you would like, I
> >> can
> >> also attach a patch, although I'm not certain it would apply properly
> >> onto
> >> MCF 1.4.1 sources.
> >>
> >> Karl
> >>
> >>
> >>
> >> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >>> Hi Florian,
> >>>
> >>> I'm pretty sure now that what is happening is that your output
> >>> connector
> >>> is throwing some kind of exception when it is asked to remove documents
> >>> during the cleanup phase of the crawl.  The state transitions in the
> >>> framework seem to be incorrect under these conditions, and the error is
> >>> likely not logged into the job's error field.  The ticket I've created
> >>> to
> >>> address this is CONNECTORS-880.
> >>>
> >>> Karl
> >>>
> >>>
> >>>
> >>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>
> >>>> The code path for an abort sequence looks pretty iron-clad.  The
> >>>> bad-case output:
> >>>>
> >>>>
> >>>> >>>>>>
> >>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
> >>>> 1385573203052
> >>>> for shutdown
> >>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
> >>>> 1385573203052 in need of notification
> >>>> <<<<<<
> >>>>
> >>>> is not including:
> >>>>
> >>>>
> >>>> >>>>>>
> >>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052
> >>>> now
> >>>> completed
> >>>> <<<<<<
> >>>>
> >>>> is very significant, because it is in that method that the last-check
> >>>> time would be updated typically, in the method JobManager.finishJob().
> >>>>  If
> >>>> an abort took place, it would have started BEFORE all this; once the
> >>>> job
> >>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job
> >>>> can be
> >>>> aborted either manually or by repository-connector related activity.
> >>>> At
> >>>> that time the job is cleaning up documents that are no longer
> >>>> reachable.  I
> >>>> will check to see what happens if the output connector throws an
> >>>> exception
> >>>> during this phase; it's the only thing I can think of that might
> >>>> potentially derail the job from finishing.
> >>>>
> >>>> Karl
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <daddywri@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Florian,
> >>>>>
> >>>>> The only way this can happen is if the proper job termination state
> >>>>> sequence does not take place.  When MCF checks to see if a job should
> >>>>> be
> >>>>> started, if it determines that the answer is "no" it updates the job
> >>>>> record
> >>>>> immediately with a new "last checked" value.  But if it starts the
> >>>>> job, it
> >>>>> waits for the job completion to take place before updating the job's
> >>>>> "last
> >>>>> checked" time.  When a job aborts, at first glance it looks like it
> >>>>> also
> >>>>> does the right thing, but clearly that's not true, and there must be
> >>>>> a bug
> >>>>> somewhere in how this condition is handled.
> >>>>>
> >>>>> I'll create a ticket to research this. In the interim, I suggest you
> >>>>> figure out why your job is aborting in the first place.
> >>>>>
> >>>>> Thanks,
> >>>>> Karl
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright
> >>>>> <daddywri@gmail.com>wrote:
> >>>>>
> >>>>>> Hi Florian,
> >>>>>>
> >>>>>> I do not expect errors to appear in the tomcat log.
> >>>>>>
> >>>>>> But this is interesting:
> >>>>>>
> >>>>>> Good:
> >>>>>>
> >>>>>> >>>>>>
> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
> >>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>> 1391439592120,
> >>>>>> and now it is 1391439602151
> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
> >>>>>> within interval 1391439592120 to 1391439602151
> >>>>>>  ...
> >>>>>>
> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
> >>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>> 1391440412615,
> >>>>>> and now it is 1391440427102
> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
> >>>>>> found
> >>>>>> within interval 1391440412615 to 1391440427102
> >>>>>> <<<<<<
> >>>>>> "last checked" time for job is updated.
> >>>>>>
> >>>>>> Bad:
> >>>>>>
> >>>>>> >>>>>>
> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
> >>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>> 1391446794075,
> >>>>>> and now it is 1391446804106
> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
> >>>>>> within interval 1391446794075 to 1391446804106
> >>>>>>  ...
> >>>>>>
> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
> >>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>> 1391446794075,
> >>>>>> and now it is 1391447647733
> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
> >>>>>> within interval 1391446794075 to 1391447647733
> >>>>>> <<<<<<
> >>>>>> Note that the "last checked" time is NOT updated.
> >>>>>>
> >>>>>> I don't understand why, in one case, the "last checked" time is
> >>>>>> being
> >>>>>> updated for the job, and is not in another case.  I will look to see
> >>>>>> if
> >>>>>> there is any way in the code that this can happen.
> >>>>>>
> >>>>>> Karl
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
> >>>>>> schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>
> >>>>>>> Hi Karl,
> >>>>>>>
> >>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log
> >>>>>>> contains only the job log messages (<property
> >>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two
> log
> >>>>>>> snippets, one from a normal run, and one where the job got repeated
> >>>>>>> two
> >>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job
> >>>>>>> notification" when the job finally terminates, and the thread
> >>>>>>> sequence
> >>>>>>> "Finisher - Job notification" when the job gets restarted again
> >>>>>>> instead of
> >>>>>>> terminating.
> >>>>>>>
> >>>>>>>
> >>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391439582108,
> >>>>>>> and now it is 1391439592119
> >>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match
> >>>>>>> found
> >>>>>>> within interval 1391439582108 to 1391439592119
> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391439592120,
> >>>>>>> and now it is 1391439602151
> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match
> >>>>>>> FOUND
> >>>>>>> within interval 1391439592120 to 1391439602151
> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job
> >>>>>>> '1385573203052' is
> >>>>>>> within run window at 1391439602151 ms. (which starts at
> >>>>>>> 1391439600000
> >>>>>>> ms.)
> >>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for
> >>>>>>> job
> >>>>>>> start
> >>>>>>> for job 1385573203052
> >>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for startup
> >>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052
> >>>>>>> is
> >>>>>>> now
> >>>>>>> started
> >>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for shutdown
> >>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job
> >>>>>>> 1385573203052
> >>>>>>> now
> >>>>>>> completed
> >>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
> >>>>>>> 1385573203052 in need of notification
> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391440412615,
> >>>>>>> and now it is 1391440427102
> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
> >>>>>>> found
> >>>>>>> within interval 1391440412615 to 1391440427102
> >>>>>>>
> >>>>>>>
> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391446784053,
> >>>>>>> and now it is 1391446794074
> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match
> >>>>>>> found
> >>>>>>> within interval 1391446784053 to 1391446794074
> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391446794075,
> >>>>>>> and now it is 1391446804106
> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match
> >>>>>>> FOUND
> >>>>>>> within interval 1391446794075 to 1391446804106
> >>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job
> >>>>>>> '1385573203052' is
> >>>>>>> within run window at 1391446804106 ms. (which starts at
> >>>>>>> 1391446800000
> >>>>>>> ms.)
> >>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for
> >>>>>>> job
> >>>>>>> start
> >>>>>>> for job 1385573203052
> >>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for startup
> >>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052
> >>>>>>> is
> >>>>>>> now
> >>>>>>> started
> >>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for shutdown
> >>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
> >>>>>>> 1385573203052 in need of notification
> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391446794075,
> >>>>>>> and now it is 1391447647733
> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match
> >>>>>>> FOUND
> >>>>>>> within interval 1391446794075 to 1391447647733
> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job
> >>>>>>> '1385573203052' is
> >>>>>>> within run window at 1391447647733 ms. (which starts at
> >>>>>>> 1391446800000
> >>>>>>> ms.)
> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391446794075,
> >>>>>>> and now it is 1391447657740
> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match
> >>>>>>> FOUND
> >>>>>>> within interval 1391446794075 to 1391447657740
> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job
> >>>>>>> '1385573203052' is
> >>>>>>> within run window at 1391447657740 ms. (which starts at
> >>>>>>> 1391446800000
> >>>>>>> ms.)
> >>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for
> >>>>>>> job
> >>>>>>> start
> >>>>>>> for job 1385573203052
> >>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for startup
> >>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052
> >>>>>>> is
> >>>>>>> now
> >>>>>>> started
> >>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for shutdown
> >>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
> >>>>>>> 1385573203052 in need of notification
> >>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391446794075,
> >>>>>>> and now it is 1391448479353
> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match
> >>>>>>> FOUND
> >>>>>>> within interval 1391446794075 to 1391448479353
> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job
> >>>>>>> '1385573203052' is
> >>>>>>> within run window at 1391448479353 ms. (which starts at
> >>>>>>> 1391446800000
> >>>>>>> ms.)
> >>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for
> >>>>>>> job
> >>>>>>> start
> >>>>>>> for job 1385573203052
> >>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for startup
> >>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052
> >>>>>>> is
> >>>>>>> now
> >>>>>>> started
> >>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
> >>>>>>> 1385573203052
> >>>>>>> for shutdown
> >>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job
> >>>>>>> 1385573203052
> >>>>>>> now
> >>>>>>> completed
> >>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
> >>>>>>> 1385573203052 in need of notification
> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
> >>>>>>> 1385573203052 needs to be started; it was last checked at
> >>>>>>> 1391449283114,
> >>>>>>> and now it is 1391449292400
> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match
> >>>>>>> found
> >>>>>>> within interval 1391449283114 to 1391449292400
> >>>>>>>
> >>>>>>>
> >>>>>>> Do you need another log output?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Florian
> >>>>>>>
> >>>>>>> > Also, what does the log have to say?  If there is an error
> >>>>>>> aborting
> >>>>>>> the
> >>>>>>> > job, there should be some record of it in the manifoldcf.log.
> >>>>>>> >
> >>>>>>> > Thanks,
> >>>>>>> > Karl
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <daddywri@gmail.com>
> >>>>>>> wrote:
> >>>>>>> >
> >>>>>>> >> Hi Florian,
> >>>>>>> >>
> >>>>>>> >> Please run the job manually, when outside the scheduling window
> >>>>>>> or
> >>>>>>> with
> >>>>>>> >> the scheduling off.  What is the reason for the job abort?
> >>>>>>> >>
> >>>>>>> >> Karl
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
> >>>>>>> >> schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>
> >>>>>>> >>> Hi Karl,
> >>>>>>> >>>
> >>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time column
> >>>>>>> when I
> >>>>>>> >>> refreshed the job status just after the number of active
> >>>>>>> documents was
> >>>>>>> >>> zero. At the next refresh the job was starting up. After
> >>>>>>> looking
> >>>>>>> in the
> >>>>>>> >>> history I found out that it even started a third time. You can
> >>>>>>> see the
> >>>>>>> >>> history of a single day below (job continue, end, start, stop,
> >>>>>>> unwait,
> >>>>>>> >>> wait). The start method is "Start at beginning of schedule
> >>>>>>> window". Job
> >>>>>>> >>> invocation is "complete". Hop count mode is "Delete unreachable
> >>>>>>> >>> documents".
> >>>>>>> >>>
> >>>>>>> >>> 02.03.2014 18:41        job end
> >>>>>>> >>> 02.03.2014 18:28        job start
> >>>>>>> >>> 02.03.2014 18:14        job start
> >>>>>>> >>> 02.03.2014 18:00        job start
> >>>>>>> >>> 02.03.2014 17:49        job end
> >>>>>>> >>> 02.03.2014 17:27        job end
> >>>>>>> >>> 02.03.2014 17:13        job start
> >>>>>>> >>> 02.03.2014 17:00        job start
> >>>>>>> >>> 02.03.2014 16:13        job end
> >>>>>>> >>> 02.03.2014 16:00        job start
> >>>>>>> >>> 02.03.2014 15:41        job end
> >>>>>>> >>> 02.03.2014 15:27        job start
> >>>>>>> >>> 02.03.2014 15:14        job start
> >>>>>>> >>> 02.03.2014 15:00        job start
> >>>>>>> >>> 02.03.2014 14:13        job end
> >>>>>>> >>> 02.03.2014 14:00        job start
> >>>>>>> >>> 02.03.2014 13:13        job end
> >>>>>>> >>> 02.03.2014 13:00        job start
> >>>>>>> >>> 02.03.2014 12:27        job end
> >>>>>>> >>> 02.03.2014 12:14        job start
> >>>>>>> >>> 02.03.2014 12:00        job start
> >>>>>>> >>> 02.03.2014 11:13        job end
> >>>>>>> >>> 02.03.2014 11:00        job start
> >>>>>>> >>> 02.03.2014 10:13        job end
> >>>>>>> >>> 02.03.2014 10:00        job start
> >>>>>>> >>> 02.03.2014 09:29        job end
> >>>>>>> >>> 02.03.2014 09:14        job start
> >>>>>>> >>> 02.03.2014 09:00        job start
> >>>>>>> >>>
> >>>>>>> >>> Best,
> >>>>>>> >>> Florian
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>> > Hi Florian,
> >>>>>>> >>> >
> >>>>>>> >>> > Jobs don't just abort randomly.  Are you sure that the job
> >>>>>>> aborted?
> >>>>>>> >>> Or
> >>>>>>> >>> > did
> >>>>>>> >>> > it just restart?
> >>>>>>> >>> >
> >>>>>>> >>> > As for "is this normal", it depends on how you have created
> >>>>>>> your job.
> >>>>>>> >>>  If
> >>>>>>> >>> > you selected the "Start within schedule window" selection,
> >>>>>>> MCF
> >>>>>>> will
> >>>>>>> >>> > restart
> >>>>>>> >>> > the job whenever it finishes and run it until the end of the
> >>>>>>> >>> scheduling
> >>>>>>> >>> > window.
> >>>>>>> >>> >
> >>>>>>> >>> > Karl
> >>>>>>> >>> >
> >>>>>>> >>> >
> >>>>>>> >>> >
> >>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
> >>>>>>> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>> >
> >>>>>>> >>> >> Hi Karl,
> >>>>>>> >>> >>
> >>>>>>> >>> >> I've just observed that the job was started according to its
> >>>>>>> >>> schedule
> >>>>>>> >>> >> and
> >>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest
> >>>>>>> all
> >>>>>>> >>> documents
> >>>>>>> >>> >> before the run). However, after finishing the last document
> >>>>>>> (zero
> >>>>>>> >>> active
> >>>>>>> >>> >> documents) it was somehow aborted and restarted immediately.
> >>>>>>> Is this
> >>>>>>> >>> an
> >>>>>>> >>> >> expected behavior?
> >>>>>>> >>> >>
> >>>>>>> >>> >> Best,
> >>>>>>> >>> >> Florian
> >>>>>>> >>> >>
> >>>>>>> >>> >>
> >>>>>>> >>> >> > Hi Florian,
> >>>>>>> >>> >> >
> >>>>>>> >>> >> > Based on this schedule, your crawls will be able to start
> >>>>>>> whenever
> >>>>>>> >>> the
> >>>>>>> >>> >> > hour
> >>>>>>> >>> >> > turns.  So they can start every hour on the hour.  If the
> >>>>>>> last
> >>>>>>> >>> crawl
> >>>>>>> >>> >> > crossed an hour boundary, the next crawl will start
> >>>>>>> immediately, I
> >>>>>>> >>> >> > believe.
> >>>>>>> >>> >> >
> >>>>>>> >>> >> > Karl
> >>>>>>> >>> >> >
> >>>>>>> >>> >> >
> >>>>>>> >>> >> >
> >>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
> >>>>>>> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>> >> >
> >>>>>>> >>> >> >> Hi Karl,
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >> these are the values:
> >>>>>>> >>> >> >> Priority:       5       Start method:   Start at
> >>>>>>> beginning
> >>>>>>> of
> >>>>>>> >>> >> schedule
> >>>>>>> >>> >> >> window
> >>>>>>> >>> >> >> Schedule type:  Scan every document once        Minimum
> >>>>>>> recrawl
> >>>>>>> >>> >> >> interval:
> >>>>>>> >>> >> >>       Not
> >>>>>>> >>> >> >> applicable
> >>>>>>> >>> >> >> Expiration interval:    Not applicable  Reseed interval:
> >>>>>>> >>> Not
> >>>>>>> >>> >> >> applicable
> >>>>>>> >>> >> >> Scheduled time:         Any day of week at 12 am 1 am 2
> >>>>>>> am
> >>>>>>> 3 am 4
> >>>>>>> >>> am
> >>>>>>> >>> >> 5
> >>>>>>> >>> >> >> am
> >>>>>>> >>> >> >> 6 am 7
> >>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6
> >>>>>>> pm 7 pm
> >>>>>>> >>> 8
> >>>>>>> >>> >> pm 9
> >>>>>>> >>> >> >> pm 10 pm 11 pm
> >>>>>>> >>> >> >> Maximum run time:       No limit        Job invocation:
> >>>>>>> >>> >> Complete
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >> Maybe it is because I've changed the job from continuous
> >>>>>>> crawling
> >>>>>>> >>> to
> >>>>>>> >>> >> >> this
> >>>>>>> >>> >> >> schedule. I started it a few times manually, too. I
> >>>>>>> couldn't
> >>>>>>> >>> notice
> >>>>>>> >>> >> >> anything strange in the job setup or in the respective
> >>>>>>> entries in
> >>>>>>> >>> the
> >>>>>>> >>> >> >> database.
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >> Regards,
> >>>>>>> >>> >> >> Florian
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >> > Hi Florian,
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> > I was unable to reproduce the behavior you described.
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> > Could you view your job, and post a screen shot of that
> >>>>>>> page?
> >>>>>>> >>> I
> >>>>>>> >>> >> want
> >>>>>>> >>> >> >> to
> >>>>>>> >>> >> >> > see what your schedule record(s) look like.
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> > Thanks,
> >>>>>>> >>> >> >> > Karl
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
> >>>>>>> >>> <daddywri@gmail.com>
> >>>>>>> >>> >> >> wrote:
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >> >> Hi Florian,
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >> I've never noted this behavior before.  I'll see if I
> >>>>>>> can
> >>>>>>> >>> >> reproduce
> >>>>>>> >>> >> >> it
> >>>>>>> >>> >> >> >> here.
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >> Karl
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> >>>>>>> >>> >> >> >> schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >>> Hi Karl,
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. However,
> >>>>>>> it runs
> >>>>>>> >>> two
> >>>>>>> >>> >> >> >>> times:
> >>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time,
> >>>>>>> finishes,
> >>>>>>> >>> and
> >>>>>>> >>> >> >> >>> immediately starts again. After finishing the second
> >>>>>>> run it
> >>>>>>> >>> waits
> >>>>>>> >>> >> >> for
> >>>>>>> >>> >> >> >>> the
> >>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? The
> >>>>>>> start
> >>>>>>> >>> method
> >>>>>>> >>> >> is
> >>>>>>> >>> >> >> >>> "Start
> >>>>>>> >>> >> >> >>> at beginning of schedule window".
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee.
> >>>>>>> Currently,
> >>>>>>> >>> our
> >>>>>>> >>> >> >> interval
> >>>>>>> >>> >> >> >>> is
> >>>>>>> >>> >> >> >>> long enough for a complete crawler run.
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>> Best,
> >>>>>>> >>> >> >> >>> Florian
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>> > Hi Florian,
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document
> >>>>>>> will
> >>>>>>> be
> >>>>>>> >>> >> checked,
> >>>>>>> >>> >> >> >>> because
> >>>>>>> >>> >> >> >>> > if
> >>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall
> >>>>>>> behind.
> >>>>>>> >>> But
> >>>>>>> >>> I
> >>>>>>> >>> >> >> will
> >>>>>>> >>> >> >> >>> look
> >>>>>>> >>> >> >> >>> > into adding the feature you request.
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>> > Karl
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding
> >>>>>>> <
> >>>>>>> >>> >> >> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>> >> Hi Karl,
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that
> >>>>>>> new
> >>>>>>> >>> >> documents
> >>>>>>> >>> >> >> are
> >>>>>>> >>> >> >> >>> >> discovered and indexed within a certain interval.
> >>>>>>> I
> >>>>>>> have
> >>>>>>> >>> >> created
> >>>>>>> >>> >> >> a
> >>>>>>> >>> >> >> >>> >> feature
> >>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to
> >>>>>>> use a
> >>>>>>> >>> >> scheduled
> >>>>>>> >>> >> >> job
> >>>>>>> >>> >> >> >>> >> instead.
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >> Thanks for your help,
> >>>>>>> >>> >> >> >>> >> Florian
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >> > Hi Florian,
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling"
> >>>>>>> behavior.  The
> >>>>>>> >>> >> time
> >>>>>>> >>> >> >> >>> between
> >>>>>>> >>> >> >> >>> >> > refetches of a document is based on the history
> >>>>>>> of
> >>>>>>> >>> fetches
> >>>>>>> >>> >> of
> >>>>>>> >>> >> >> that
> >>>>>>> >>> >> >> >>> >> > document.  The recrawl interval is the initial
> >>>>>>> time
> >>>>>>> >>> between
> >>>>>>> >>> >> >> >>> document
> >>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, the
> >>>>>>> interval
> >>>>>>> >>> for
> >>>>>>> >>> >> >> the
> >>>>>>> >>> >> >> >>> >> document
> >>>>>>> >>> >> >> >>> >> > increases according to a formula.
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able to
> >>>>>>> give you
> >>>>>>> >>> the
> >>>>>>> >>> >> >> >>> precise
> >>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the amount
> >>>>>>> of
> >>>>>>> time
> >>>>>>> >>> >> between
> >>>>>>> >>> >> >> >>> >> document
> >>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket
> >>>>>>> and
> >>>>>>> I will
> >>>>>>> >>> >> look
> >>>>>>> >>> >> >> into
> >>>>>>> >>> >> >> >>> >> adding
> >>>>>>> >>> >> >> >>> >> > that as a feature.
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> > Thanks,
> >>>>>>> >>> >> >> >>> >> > Karl
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian
> >>>>>>> Schmedding
> >>>>>>> <
> >>>>>>> >>> >> >> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >> >> Hello,
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl
> >>>>>>> interval of
> >>>>>>> >>> a
> >>>>>>> >>> >> >> >>> continuous
> >>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The
> >>>>>>> >>> documentation
> >>>>>>> >>> >> >> tells
> >>>>>>> >>> >> >> >>> that
> >>>>>>> >>> >> >> >>> >> the
> >>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the
> >>>>>>> seeds
> >>>>>>> are
> >>>>>>> >>> >> checked
> >>>>>>> >>> >> >> >>> again,
> >>>>>>> >>> >> >> >>> >> and
> >>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which a
> >>>>>>> document
> >>>>>>> >>> is
> >>>>>>> >>> >> >> >>> checked
> >>>>>>> >>> >> >> >>> >> for
> >>>>>>> >>> >> >> >>> >> >> changes.
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl interval
> >>>>>>> for a
> >>>>>>> >>> >> document
> >>>>>>> >>> >> >> >>> >> increases
> >>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed
> >>>>>>> >>> interval
> >>>>>>> >>> >> seems
> >>>>>>> >>> >> >> to
> >>>>>>> >>> >> >> >>> be
> >>>>>>> >>> >> >> >>> >> set
> >>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about the
> >>>>>>> seed
> >>>>>>> >>> >> >> documents.
> >>>>>>> >>> >> >> >>> Yet
> >>>>>>> >>> >> >> >>> >> the
> >>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each
> >>>>>>> time
> >>>>>>> the
> >>>>>>> >>> >> interval
> >>>>>>> >>> >> >> >>> >> elapses
> >>>>>>> >>> >> >> >>> >> >> but
> >>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed.
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server
> >>>>>>> does
> >>>>>>> not
> >>>>>>> >>> tell
> >>>>>>> >>> >> the
> >>>>>>> >>> >> >> >>> client
> >>>>>>> >>> >> >> >>> >> to
> >>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be
> >>>>>>> appreciated.
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >> Best regards,
> >>>>>>> >>> >> >> >>> >> >> Florian
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >>
> >>>>>>> >>> >> >> >>> >> >
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >>
> >>>>>>> >>> >> >> >>> >
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>>
> >>>>>>> >>> >> >> >>
> >>>>>>> >>> >> >> >
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >>
> >>>>>>> >>> >> >
> >>>>>>> >>> >>
> >>>>>>> >>> >>
> >>>>>>> >>> >>
> >>>>>>> >>> >
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>
> >>>>>>> >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>

Mime
View raw message