manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Wed, 15 Jan 2014 19:01:21 GMT
Hi Florian,

Based on this schedule, your crawls will be able to start whenever the hour
turns.  So they can start every hour on the hour.  If the last crawl
crossed an hour boundary, the next crawl will start immediately, I believe.

Karl



On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
schmeddi@informatik.uni-freiburg.de> wrote:

> Hi Karl,
>
> these are the values:
> Priority:       5       Start method:   Start at beginning of schedule
> window
> Schedule type:  Scan every document once        Minimum recrawl interval:
>       Not
> applicable
> Expiration interval:    Not applicable  Reseed interval:        Not
> applicable
> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am 5 am
> 6 am 7
> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9
> pm 10 pm 11 pm
> Maximum run time:       No limit        Job invocation:         Complete
>
> Maybe it is because I've changed the job from continuous crawling to this
> schedule. I started it a few times manually, too. I couldn't notice
> anything strange in the job setup or in the respective entries in the
> database.
>
> Regards,
> Florian
>
> > Hi Florian,
> >
> > I was unable to reproduce the behavior you described.
> >
> > Could you view your job, and post a screen shot of that page?  I want to
> > see what your schedule record(s) look like.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <daddywri@gmail.com> wrote:
> >
> >> Hi Florian,
> >>
> >> I've never noted this behavior before.  I'll see if I can reproduce it
> >> here.
> >>
> >> Karl
> >>
> >>
> >>
> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> >> schmeddi@informatik.uni-freiburg.de> wrote:
> >>
> >>> Hi Karl,
> >>>
> >>> the scheduled job seems to work as expecetd. However, it runs two
> >>> times:
> >>> It starts at the beginning of the scheduled time, finishes, and
> >>> immediately starts again. After finishing the second run it waits for
> >>> the
> >>> next scheduled time. Why does it run two times? The start method is
> >>> "Start
> >>> at beginning of schedule window".
> >>>
> >>> Yes, you're right about the checking guarantee. Currently, our interval
> >>> is
> >>> long enough for a complete crawler run.
> >>>
> >>> Best,
> >>> Florian
> >>>
> >>>
> >>> > Hi Florian,
> >>> >
> >>> > It is impossible to *guarantee* that a document will be checked,
> >>> because
> >>> > if
> >>> > load on the crawler is high enough, it will fall behind.  But I will
> >>> look
> >>> > into adding the feature you request.
> >>> >
> >>> > Karl
> >>> >
> >>> >
> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> >>> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>> >
> >>> >> Hi Karl,
> >>> >>
> >>> >> yes, in our case it is necessary to make sure that new documents
are
> >>> >> discovered and indexed within a certain interval. I have created
a
> >>> >> feature
> >>> >> request on that. In the meantime we will try to use a scheduled
job
> >>> >> instead.
> >>> >>
> >>> >> Thanks for your help,
> >>> >> Florian
> >>> >>
> >>> >>
> >>> >> > Hi Florian,
> >>> >> >
> >>> >> > What you are seeing is "dynamic crawling" behavior.  The time
> >>> between
> >>> >> > refetches of a document is based on the history of fetches
of that
> >>> >> > document.  The recrawl interval is the initial time between
> >>> document
> >>> >> > fetches, but if a document does not change, the interval for
the
> >>> >> document
> >>> >> > increases according to a formula.
> >>> >> >
> >>> >> > I would need to look at the code to be able to give you the
> >>> precise
> >>> >> > formula, but if you need a limit on the amount of time between
> >>> >> document
> >>> >> > fetch attempts, I suggest you create a ticket and I will look
into
> >>> >> adding
> >>> >> > that as a feature.
> >>> >> >
> >>> >> > Thanks,
> >>> >> > Karl
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> >>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
> >>> >> >
> >>> >> >> Hello,
> >>> >> >>
> >>> >> >> the parameters reseed interval and recrawl interval of
a
> >>> continuous
> >>> >> >> crawling job are not quite clear to me. The documentation
tells
> >>> that
> >>> >> the
> >>> >> >> reseed interval is the time after which the seeds are
checked
> >>> again,
> >>> >> and
> >>> >> >> the recrawl interval is the time after which a document
is
> >>> checked
> >>> >> for
> >>> >> >> changes.
> >>> >> >>
> >>> >> >> However, we observed that the recrawl interval for a document
> >>> >> increases
> >>> >> >> after each check. On the other hand, the reseed interval
seems to
> >>> be
> >>> >> set
> >>> >> >> up correctly in the database metadata about the seed documents.
> >>> Yet
> >>> >> the
> >>> >> >> web server does not receive requests at each time the
interval
> >>> >> elapses
> >>> >> >> but
> >>> >> >> only after several intervals have elapsed.
> >>> >> >>
> >>> >> >> We are using a web connector. The web server does not
tell the
> >>> client
> >>> >> to
> >>> >> >> cache the documents. Any help would be appreciated.
> >>> >> >>
> >>> >> >> Best regards,
> >>> >> >> Florian
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>
> >
>
>
>

Mime
View raw message