nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From massimo miccoli <mmicc...@iltrovatore.it>
Subject IlTrovatore check: e' SPAM? Re: [Fwd: Fetch list priority]
Date Mon, 03 Oct 2005 18:10:07 GMT
+1

I have read the paper about OPIc and it seam very good. I think it a  
must for Nutch to have good (and fast) rank algo webgraph based. I  
have fetched about 250 milions of pages and what I see is that the  
only inlinks count is not good for big crawl and quality results.

Thanks,

Massimo

Il giorno 29/set/05, alle ore 23:38, Doug Cutting ha scritto:

> Here's some interesting stuff about OPIC, an easy-to-calculate link- 
> based measure of page quality.  I'm going to read the papers, and  
> if it is a good as it sounds, perhaps implement this in the mapred  
> branch.  Does anyone have experience with OPIC?
>
> -------- Original Message --------
> Subject: Fetch list priority
> Date: Thu, 29 Sep 2005 10:57:31 +0200
> From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
> Organization: Universitat Pompeu Fabra
>
> Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
> during the OSWIR workshop.
>
> I told you I would contact you about the priorities of the crawler;  
> and
> that there were best strategies than using log(indegree). I  
> suggested to
> use OPIC (online page importance computation).
>
> OPIC is described here by Abiteboul et al.:
>
> http://www.citeulike.org/user/ChaTo/article/240858
>
> We did experiments with OPIC in two collections of 2-million pages  
> each,
> and we tested that these collections have the same power-law exponents
> that the full web [I'm attaching a graph of Pagerank vs page
> downloaded]. Ordering pages by indegree is as bad as random:
>
> http://www.citeulike.org/user/ChaTo/article/240824
>
> http://www.citeulike.org/user/ChaTo/article/240898
>
> Why? Because the crawler tends to focus in a few Web sites. See for
> instance Boldi et al.  "Do your worst to make the best":
>
> http://www.citeulike.org/user/ChaTo/article/240866
>
> ====================================================================== 
> =
>
> Here is the general idea of OPIC: at the beginning, each page has the
> same score. Let's call it 'opic':
>
>   for all initial pages i:
>      opic[i] = 1;
>
> Whenever you find a link:
>
>   opic[destination] += opic[source] / outdegree[source];
>
> This is it. Abiteboul's paper proves that this converges even in a
> changing graph, and that it is a good estimator of quality. He also
> suggests using the history of a page to keep it's opic across crawls,
> but even without the history we have seen that it works quite well.
>
> In your case, what you do in org.apache.nutch.tools.FetchListTool is:
>     ...
>     String[] anchors = dbAnchors.getAnchors(page.getURL());
>     curScore.set(scoreByLinkCount ?
>       (float)Math.log(anchors.length+1) : page.getScore());
>     ...
>
> You need something different, because you will have to read the scores
> of the pages that are pointing to your page. You can do it by (a)
> keeping or reading the scores of the inlinks to each page or (b) do  
> this
> cycle for the source pages in the other order:
>
>    for each page P in the webdb:
>      for each outlinks in page P
>        opic[destination] += opic[P] / outdegree[P];
>
> Note that to make this more effective you must also update the  
> 'opic' of
> the pages you already crawled, and that I think you should avoid  
> self-links.
>
> The 'opic' scores will also be statistically distributed according  
> to a
> power-law so it's sensible to use log(opic) when combining this with
> other scores with a different distribution, such as text similarity.
>
> ====================================================================== 
> ==
>
> I hope this is useful for you.
>
> All the best,
>
> -- 
> ChaTo    = Carlos Alberto-Alejandro CASTILLO-Ocaranza, PhD
>
>


Mime
View raw message