Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 10649 invoked from network); 3 Oct 2005 18:10:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Oct 2005 18:10:40 -0000 Received: (qmail 42753 invoked by uid 500); 3 Oct 2005 18:10:38 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 42730 invoked by uid 500); 3 Oct 2005 18:10:38 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 42719 invoked by uid 99); 3 Oct 2005 18:10:38 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2005 11:10:38 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [213.215.201.198] (HELO mail.iltrovatore.it) (213.215.201.198) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 03 Oct 2005 11:10:41 -0700 Received: (qmail 9386 invoked by uid 512); 3 Oct 2005 18:10:11 -0000 Received: from mmiccoli@iltrovatore.it by mail.iltrovatore.it by uid 510 with qmail-scanner-1.21st (clamdscan: 0.70 spamassassin: 2.63. Clear:RC:0(213.156.52.96):SA:1(8.8/8.0):. Processed in 2.437561 secs); 03 Oct 2005 18:10:11 -0000 X-Spam-Level: ++++++++ Received: from 213-156-52-96.fastres.net (HELO ?1.0.88.42?) (massimo@cairoweb.it@213.156.52.96) by mail.iltrovatore.it with SMTP; 3 Oct 2005 18:10:08 -0000 Mime-Version: 1.0 (Apple Message framework v734) In-Reply-To: <433C5EDE.8000500@nutch.org> References: <433C5EDE.8000500@nutch.org> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <6BDAE100-DDD7-4622-AB5F-537A7EAFF61D@iltrovatore.it> Content-Transfer-Encoding: 7bit From: massimo miccoli Subject: IlTrovatore check: e' SPAM? Re: [Fwd: Fetch list priority] Date: Mon, 3 Oct 2005 20:10:07 +0200 To: nutch-dev@lucene.apache.org X-Mailer: Apple Mail (2.734) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: Yes, hits=8.8 required=8.0 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N +1 I have read the paper about OPIc and it seam very good. I think it a must for Nutch to have good (and fast) rank algo webgraph based. I have fetched about 250 milions of pages and what I see is that the only inlinks count is not good for big crawl and quality results. Thanks, Massimo Il giorno 29/set/05, alle ore 23:38, Doug Cutting ha scritto: > Here's some interesting stuff about OPIC, an easy-to-calculate link- > based measure of page quality. I'm going to read the papers, and > if it is a good as it sounds, perhaps implement this in the mapred > branch. Does anyone have experience with OPIC? > > -------- Original Message -------- > Subject: Fetch list priority > Date: Thu, 29 Sep 2005 10:57:31 +0200 > From: Carlos Alberto-Alejandro CASTILLO-Ocaranza > Organization: Universitat Pompeu Fabra > > Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne > during the OSWIR workshop. > > I told you I would contact you about the priorities of the crawler; > and > that there were best strategies than using log(indegree). I > suggested to > use OPIC (online page importance computation). > > OPIC is described here by Abiteboul et al.: > > http://www.citeulike.org/user/ChaTo/article/240858 > > We did experiments with OPIC in two collections of 2-million pages > each, > and we tested that these collections have the same power-law exponents > that the full web [I'm attaching a graph of Pagerank vs page > downloaded]. Ordering pages by indegree is as bad as random: > > http://www.citeulike.org/user/ChaTo/article/240824 > > http://www.citeulike.org/user/ChaTo/article/240898 > > Why? Because the crawler tends to focus in a few Web sites. See for > instance Boldi et al. "Do your worst to make the best": > > http://www.citeulike.org/user/ChaTo/article/240866 > > ====================================================================== > = > > Here is the general idea of OPIC: at the beginning, each page has the > same score. Let's call it 'opic': > > for all initial pages i: > opic[i] = 1; > > Whenever you find a link: > > opic[destination] += opic[source] / outdegree[source]; > > This is it. Abiteboul's paper proves that this converges even in a > changing graph, and that it is a good estimator of quality. He also > suggests using the history of a page to keep it's opic across crawls, > but even without the history we have seen that it works quite well. > > In your case, what you do in org.apache.nutch.tools.FetchListTool is: > ... > String[] anchors = dbAnchors.getAnchors(page.getURL()); > curScore.set(scoreByLinkCount ? > (float)Math.log(anchors.length+1) : page.getScore()); > ... > > You need something different, because you will have to read the scores > of the pages that are pointing to your page. You can do it by (a) > keeping or reading the scores of the inlinks to each page or (b) do > this > cycle for the source pages in the other order: > > for each page P in the webdb: > for each outlinks in page P > opic[destination] += opic[P] / outdegree[P]; > > Note that to make this more effective you must also update the > 'opic' of > the pages you already crawled, and that I think you should avoid > self-links. > > The 'opic' scores will also be statistically distributed according > to a > power-law so it's sensible to use log(opic) when combining this with > other scores with a different distribution, such as text similarity. > > ====================================================================== > == > > I hope this is useful for you. > > All the best, > > -- > ChaTo = Carlos Alberto-Alejandro CASTILLO-Ocaranza, PhD > >