lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: reoot site query results
Date Mon, 06 Dec 2004 13:22:34 GMT
Perhaps look at Nutch to see whether (and if so, how) it deals with 
this situation.

Determining the root seems to be a pretty tricky endeavor.  Each of 
these could be a root:

	http://www.ehatchersolutions.com/JavaDevWithAnt
	http://www.example.com/~username

And certainly lots of other combinations like you've described.

	Erik


On Dec 6, 2004, at 5:54 AM, Chris Fraschetti wrote:

> I do this to some extent... currently I apply a boost if its as best i
> can tell a root page. But I am more asking how to determine root
> pages... content obviously isn't easy to use ... the url is the main
> key... but that can be tricky as well...  Basically the pages are from
> a crawl.. so their urls so how they were originally linked to.. i.e.
> http://www.microsoft.com may have been visited via an outgoing link of
> another page as http://www.microsoft.com/index.asp?title=true  or some
> variant like that. the page is still the root, but now contains a
> page. Further into that I can simple check the hostname of the url
> using java's URL class, as well as the path that the URL class gives
> me... but how much of a boost would be appropriate. Too must of a
> boost might make it return higher than perhaps a non-root page which
> is more relevant.
>
>
> On Mon, 6 Dec 2004 05:12:27 -0500, Erik Hatcher
> <erik@ehatchersolutions.com> wrote:
>>
>>
>>
>> On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
>>> My lucene implementation works great, its basically an index of many
>>> web crawls. The main thing my users complain about is say a search 
>>> for
>>> "slashdot" will return the
>>> http://www.slashdot.org/soem_dir/somepage.asp as the top result
>>> because the factors i have scoring it determine it as so... but
>>> obviously in true search engine fashion.. i would like
>>> http://www.slashdot.org/ to be the very top result... i've added a
>>> boost to queries that match the hostname field, which helped a 
>>> little,
>>> but obviously not a proper solution. Does anyone out there in the
>>> search engine world have a good schema for determining root websites
>>> and applying a huge boost to them in one fashion or another? mainly 
>>> so
>>> it appears before any sub pages? (assuming the query is in reference
>>> to that site) ...
>>
>> Consider applying the boost to the Document, rather than the field, at
>> index time.  I assume each document in your index represents one page.
>> At indexing time you know whether it is a root page or not, right?
>>
>>        Erik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
>
> -- 
> ___________________________________________________
> Chris Fraschetti, Student CompSci System Admin
> University of San Francisco
> e fraschetti@gmail.com | http://meteora.cs.usfca.edu
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message