manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
Date Thu, 13 Feb 2014 12:47:34 GMT
Here's the algorithm that MCF uses to calculate when to refetch a document
in dynamic crawling.

First, it keeps track, over all time, of the first time the document was
fetched, and the last time it was fetched, and the number of changes that
took place in-between, to come up with an estimated value for the average
time between changes.  When you change the document, of course, this value
is affected, but may not be affected that strongly if the document had a
long period of stability.  (If you want to make this history go away for a
document, you can click the "reindex all documents" link on the output
connection's view page.  That causes MCF to forget everything about what's
been indexed before.)

The actual time determined for the next fetch is calculated here:

{code}
    public Long calculateDocumentRescheduleTime(long currentTime, long
timeAmt, String localIdentifier)
    {
      Long recrawlTime = null;
      Long recrawlInterval = job.getInterval();
      if (recrawlInterval != null)
      {
        Long maxInterval = job.getMaxInterval();
        long actualInterval = recrawlInterval.longValue() + timeAmt;
        if (maxInterval != null && actualInterval > maxInterval.longValue())
          actualInterval = maxInterval.longValue();
        recrawlTime = new Long(currentTime + actualInterval);
      }
      if (Logging.scheduling.isDebugEnabled())
        Logging.scheduling.debug("Default rescan time for document
'"+localIdentifier+"' is
"+((recrawlTime==null)?"NEVER":recrawlTime.toString()));
      Long lowerBound =
getDocumentRescheduleLowerBoundTime(localIdentifier);
      if (lowerBound != null)
      {
        if (recrawlTime == null || recrawlTime.longValue() <
lowerBound.longValue())
        {
          recrawlTime = lowerBound;
          if (Logging.scheduling.isDebugEnabled())
            Logging.scheduling.debug(" Rescan time overridden for document
'"+localIdentifier+"' due to lower bound; new value is
"+recrawlTime.toString());
        }
      }
      Long upperBound =
getDocumentRescheduleUpperBoundTime(localIdentifier);
      if (upperBound != null)
      {
        if (recrawlTime == null || recrawlTime.longValue() >
upperBound.longValue())
        {
          recrawlTime = upperBound;
          if (Logging.scheduling.isDebugEnabled())
            Logging.scheduling.debug(" Rescan time overridden for document
'"+localIdentifier+"' due to upper bound; new value is
"+recrawlTime.toString());
        }
      }
      return recrawlTime;
    }

{code}

As you can see, both the average interval between fetches (timeAmt), and
what the connector sets as far as time bounds are concerned, go into the
calculation.  The minimum recrawl interval (job.getInterval()) and the
maximum recrawl interval (job.getMaxInterval()) are also important.  The
key part of the calculation is as follows:

{code}
        Long maxInterval = job.getMaxInterval();
        long actualInterval = recrawlInterval.longValue() + timeAmt;
        if (maxInterval != null && actualInterval > maxInterval.longValue())
          actualInterval = maxInterval.longValue();
        recrawlTime = new Long(currentTime + actualInterval);
{code}

The actual interval chosen is the job's minimum recrawl interval, plus the
average time between changes for the document, capped by the job's maximum
recrawl interval.

Hope that clarifies things.



On Thu, Feb 13, 2014 at 7:16 AM, Karl Wright (JIRA) <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900271#comment-13900271]
>
> Karl Wright commented on CONNECTORS-850:
> ----------------------------------------
>
> Anything that you change for the job that affects what is indexed.  For
> example, forced metadata, Solr variable mapping, etc. all will cause a
> reindex to take place.
>
> If you want to get specific, there are two version strings, one for the
> repository connection, and another for the output connection.  It's up to
> the connector what to put in them.  For the web connector, the document's
> metadata (from ALL header fields), URL mappings (if any), and a checksum of
> the content goes into it.  For the solr output connector, metadata and some
> kinds of configuration information go into it.
>
> If this isn't making any sense in your case, I suppose you can debug it --
> or look in the database at some version fields to see what is changing.
>
> > Maximum interval in dynamic crawling
> > ------------------------------------
> >
> >                 Key: CONNECTORS-850
> >                 URL:
> https://issues.apache.org/jira/browse/CONNECTORS-850
> >             Project: ManifoldCF
> >          Issue Type: New Feature
> >          Components: Framework crawler agent
> >    Affects Versions: ManifoldCF 1.4.1
> >            Reporter: Florian Schmedding
> >            Assignee: Karl Wright
> >            Priority: Minor
> >              Labels: features
> >             Fix For: ManifoldCF 1.5
> >
> >
> > Currently, the dynamic crawling method used for a continuous job extends
> the reseed and recrawl intervals when no changes are found in a checked
> document. However, it should be possible to restrict this extension to a
> maximum value in order to make sure that new documents are discovered
> within a certain interval.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message