lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: reoot site query results
Date Mon, 06 Dec 2004 13:22:34 GMT
Perhaps look at Nutch to see whether (and if so, how) it deals with 
this situation.

Determining the root seems to be a pretty tricky endeavor.  Each of 
these could be a root:

And certainly lots of other combinations like you've described.


On Dec 6, 2004, at 5:54 AM, Chris Fraschetti wrote:

> I do this to some extent... currently I apply a boost if its as best i
> can tell a root page. But I am more asking how to determine root
> pages... content obviously isn't easy to use ... the url is the main
> key... but that can be tricky as well...  Basically the pages are from
> a crawl.. so their urls so how they were originally linked to.. i.e.
> may have been visited via an outgoing link of
> another page as  or some
> variant like that. the page is still the root, but now contains a
> page. Further into that I can simple check the hostname of the url
> using java's URL class, as well as the path that the URL class gives
> me... but how much of a boost would be appropriate. Too must of a
> boost might make it return higher than perhaps a non-root page which
> is more relevant.
> On Mon, 6 Dec 2004 05:12:27 -0500, Erik Hatcher
> <> wrote:
>> On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
>>> My lucene implementation works great, its basically an index of many
>>> web crawls. The main thing my users complain about is say a search 
>>> for
>>> "slashdot" will return the
>>> as the top result
>>> because the factors i have scoring it determine it as so... but
>>> obviously in true search engine fashion.. i would like
>>> to be the very top result... i've added a
>>> boost to queries that match the hostname field, which helped a 
>>> little,
>>> but obviously not a proper solution. Does anyone out there in the
>>> search engine world have a good schema for determining root websites
>>> and applying a huge boost to them in one fashion or another? mainly 
>>> so
>>> it appears before any sub pages? (assuming the query is in reference
>>> to that site) ...
>> Consider applying the boost to the Document, rather than the field, at
>> index time.  I assume each document in your index represents one page.
>> At indexing time you know whether it is a root page or not, right?
>>        Erik
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> -- 
> ___________________________________________________
> Chris Fraschetti, Student CompSci System Admin
> University of San Francisco
> e |
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message