nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: modifying inbound link text calc
Date Mon, 15 May 2006 15:08:29 GMT
Insurance Squared Inc. wrote:
> I'm trying to get rid of some spammy sites in our index.
> First, I wonder if anyone has any suggestions on changes to the 
> default install config of Nutch that will help drive better sites to 
> the top and spammier sites down.

What is a "better" site? Depending on how you define this, and how 
precise is your definition, you should get clear indications how to 
improve the quality.

>
> Secondly, I boosted the inbound anchor text config - but if anything 
> that made things worse.  A lot of the spammier sites heavily use 
> search terms intheir  internal anchors.  So I'm wondering - is there 
> any easy way to distinguish between anchor text from within the same 
> domain vs. anchor text from external domains, and give them different 
> weightings?  I expect this isn't the case currently - anyone have any 
> opinions on how difficult this would be to change?

The scoring API  (just committed) gives you this option. Please see 
ScoringFilter's method indexerScore.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message