manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How to add tast to queue dynamically (WebCrawler)
Date Tue, 05 Apr 2011 15:53:26 GMT
Hi Fuad,

Ok, so this is for politeness?
I am sure you've looked at what the RSS and Web connectors do to
enforce politeness constraints.  As you probably know, the framework
has the ability to throttle all connections using AVERAGE fetch rate
throttling (see the "Throttling" tab for the connection).  But if you
need to make sure you do not exceed a MAXIMUM rate, the standard is to
adopt logic similar to that used by the RSS and Web connectors, which
limit connection count as well as maximum fetch rate by way of
connector-based throttling.

I suppose that you may not like the Thread.sleep() you see in the
throttling code in the RSS and Web connectors.  Since these connectors
are throttling max connections as well as maximum fetch rate, it was
not possible in all cases to avoid Thread.sleep().  But I can see a
case for trying to control scheduling of documents for the purposes of
enforcing a maximum fetch rate alone.

In order for that to work, you'd need connector control over the
schedule for every way a document can be added to the job queue.  The
addDocumentReference() method is only one such case; you'd also want
similar functionality for addSeedDocuments().  I'd suggest creating a
ticket for this change to the API.  FWIW, I don't think this is a big
win for either web or rss crawling, since all that the Thread.sleep()
does is reduce (slightly) the number of available threads, so I'd
prioritize it accordingly.

Karl


On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <fuad@efendi.ca> wrote:
> Hi Karl,
>
> I need to crawl sequence of (different) URLs from the same host, and each
> URL defines next one to be crawled; I can crawl next URL only after
> specified amount of time. URLs are different... of course I can use
> Thread.currentThread.sleep() before calling
> activities.addDocumentReference(newUrl) but it seems too naïve...
> And this use case is much similar to generic Web crawl (when we need to be
> polite, 2-3 seconds delay before recrawl from same domain)
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: April-05-11 11:06 AM
> To: connectors-user@incubator.apache.org
> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>
> If you are trying to control the schedule for the FIRST time a document is
> fetched, the IProcessActivity API doesn't permit that at this time.  You
> would need to add a new version of
> addDocumentReference() to the IProcessActivity interface, which allowed you
> to set the scheduled processing time in addition to everything else.  The
> internals for such a change should be straightforward since all the moving
> parts are already there.
>
> I'm curious, however, about your use case.  It is currently unheard of for
> connectors to try to control the scheduling of all documents being fetched -
> this would interfere with ManifoldCF's scheduling algorithms, which are
> designed for maximum throughput.  I'd like to be sure your design makes
> sense before I agree that this is a reasonable addition to the API.  Can you
> explain the connector and its design so that I can see what you are trying
> to accomplish?
>
> Thanks!
> Karl
>
> On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fuad@efendi.ca> wrote:
>>
>> Hi Karl,
>>
>> So this is "retry"... can we schedule document retrieval? I retrieve
>> XML, generate new URL, and I want to schedule this new Document to be
>> retrieved at specific time -Fuad
>>
>>
>
>

Mime
View raw message