Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: nutch-dev@lucene.apache.org
Message-ID: <20219282.1160794196409.JavaMail.jira@brutus>
Date: Fri, 13 Oct 2006 19:49:56 -0700 (PDT)
From: "Sami Siren (JIRA)" <jira@apache.org>
To: nutch-dev@lucene.apache.org
Subject: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
 improvements
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12442195 ] 
            
Sami Siren commented on NUTCH-339:
----------------------------------


   [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]]

The original Fetcher is no longer being polite?

Other than that both seem to be working ok based on a very
small crawl I did.

Some thoughts about the design (or perhaps more about how I did it :)

-the FetchQueue implementation could be in own class(file).

-I moved also the class that handles robots parsing to core

-I used existing FibonacciHeap.java (in org.apache.nutch.util) to back 
up the fething queue the priority i used was the time(in seconds) one 
can again fetch from that particular site, you can then use queue.peek 
to see the highest priority site (the one that should be fetched next) 
and check it's time and if needed read more records from recordreader.

-I created new Object Site that i queued, those objects contained a list 
of urls from that site to be fetched from that site and a real time (in 
seconds) when one can fetch again).

-Queue did hide the recordreader so fetcher threads only had to deal 
with this queue

-I didn't add eny special method for robots.rules in Protocol interface 
(it's just like any other resource that's going to be fetched but 
instead when  a http url was read from recordreader for a site that has 
not earlier seen the robots.txt was put as a normal resource for that 
site to be fetched earlier (Site). and when that resource was fetched it 
was advertiset to FetchingQueue wich then parsed it and stored it in 
FetchSite object.

- Also by using this FetchSite object I could easily implement some 
useful methods like block all urls from this site (for example when 
hostname cannot be resolved, or connections constanlty time out etc...)

Attached you can find a simple drawing I did earlier about the new 
fetcher I had in mind - just for a reference if my words are confusing :)

--
  Sami Siren


[demime 1.01d removed an attachment of type image/png which had a name of fetcher.png]


> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira