lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: reoot site query results
Date Mon, 06 Dec 2004 17:10:04 GMT
In web search, link information helps greatly.  (This was Google's big 
discovery.)  There are lots more links that point to 
http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and 
many (if not most) of these links have the term "slashdot", while links 
to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain 
the term "slashdot".

As Erik hinted, Nutch uses this information.  It keeps has a database of 
links that point to each page, indexes their anchor text along with the 
page, and boosts highly linked pages more than lesser linked pages.

Doug

Chris Fraschetti wrote:
> My lucene implementation works great, its basically an index of many
> web crawls. The main thing my users complain about is say a search for
> "slashdot" will return the
> http://www.slashdot.org/soem_dir/somepage.asp as the top result
> because the factors i have scoring it determine it as so... but
> obviously in true search engine fashion.. i would like
> http://www.slashdot.org/ to be the very top result... i've added a
> boost to queries that match the hostname field, which helped a little,
> but obviously not a proper solution. Does anyone out there in the
> search engine world have a good schema for determining root websites
> and applying a huge boost to them in one fashion or another? mainly so
> it appears before any sub pages? (assuming the query is in reference
> to that site) ...
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message