Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 89499 invoked from network); 14 Oct 2006 02:51:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Oct 2006 02:51:34 -0000 Received: (qmail 94346 invoked by uid 500); 14 Oct 2006 02:50:49 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 94316 invoked by uid 500); 14 Oct 2006 02:50:49 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 94177 invoked by uid 99); 14 Oct 2006 02:50:48 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Oct 2006 19:50:48 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Oct 2006 19:50:46 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 64F1A714324 for ; Fri, 13 Oct 2006 19:49:56 -0700 (PDT) Message-ID: <20219282.1160794196409.JavaMail.jira@brutus> Date: Fri, 13 Oct 2006 19:49:56 -0700 (PDT) From: "Sami Siren (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12442195 ] Sami Siren commented on NUTCH-339: ---------------------------------- [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]] The original Fetcher is no longer being polite? Other than that both seem to be working ok based on a very small crawl I did. Some thoughts about the design (or perhaps more about how I did it :) -the FetchQueue implementation could be in own class(file). -I moved also the class that handles robots parsing to core -I used existing FibonacciHeap.java (in org.apache.nutch.util) to back up the fething queue the priority i used was the time(in seconds) one can again fetch from that particular site, you can then use queue.peek to see the highest priority site (the one that should be fetched next) and check it's time and if needed read more records from recordreader. -I created new Object Site that i queued, those objects contained a list of urls from that site to be fetched from that site and a real time (in seconds) when one can fetch again). -Queue did hide the recordreader so fetcher threads only had to deal with this queue -I didn't add eny special method for robots.rules in Protocol interface (it's just like any other resource that's going to be fetched but instead when a http url was read from recordreader for a site that has not earlier seen the robots.txt was put as a normal resource for that site to be fetched earlier (Site). and when that resource was fetched it was advertiset to FetchingQueue wich then parsed it and stored it in FetchSite object. - Also by using this FetchSite object I could easily implement some useful methods like block all urls from this site (for example when hostname cannot be resolved, or connections constanlty time out etc...) Attached you can find a simple drawing I did earlier about the new fetcher I had in mind - just for a reference if my words are confusing :) -- Sami Siren [demime 1.01d removed an attachment of type image/png which had a name of fetcher.png] > Refactor nutch to allow fetcher improvements > -------------------------------------------- > > Key: NUTCH-339 > URL: http://issues.apache.org/jira/browse/NUTCH-339 > Project: Nutch > Issue Type: Task > Components: fetcher > Affects Versions: 0.8 > Environment: n/a > Reporter: Sami Siren > Assigned To: Sami Siren > Fix For: 0.9.0 > > Attachments: patch.txt, patch2.txt, patch3.txt > > > As I (and Stefan?) see it there are two major areas the current fetcher could be > improved (as in speed) > 1. Politeness code and how it is implemented is the biggest > problem of current fetcher(together with robots.txt handling). > With a simple code changes like replacing it with a PriorityQueue > based solution showed very promising results in increased IO. > 2. Changing fetcher to use non blocking io (this requires great amount > of work as we need to implement the protocols from scratch again). > I would like to start with working towards #1 by first refactoring > the current code (plugins actually) in following way: > 1. Move robots.txt handling away from (lib-http)plugin. > Even if this is related only to http, leaving it to lib-http > does not allow other kinds of scheduling strategies to be implemented > (it is hardcoded to fetch robots.txt from the same thread when requesting > a page from a site from witch it hasn't tried to load robots.txt) > 2. Move code for politeness away from (lib-http)plugin > It is really usable outside http and also the current design limits > changing of the implementation (to queue based) > Where to move these, well my suggestion is the nutch core, does anybody > see problems with this? > These code refactoring activities are to be done in a way that none > of the current functionality is (at least deliberately) changed leaving > current functionality as is thus leaving room and possibility to build > the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira