lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mpolzin <mikepol...@yahoo.com>
Subject Re: Limiting search result for web search engine
Date Thu, 04 Feb 2010 05:23:09 GMT

I changed one line below... realized I missed the ! (NOT).. corrected in
original reply.


 if ((hq.Size() < numHits || score >= minScore)  &&
!collectedBaseURLArray.Contains(doc.BaseURL)) 
                { 

mpolzin wrote:
> 
> 
>             if (score > 0.0f) 
>             { 
> 
>                 // Do something here to get the document base URL
> (doc.BaseURL) 
> 
>                 if ((hq.Size() < numHits || score >= minScore)  &&
> !collectedBaseURLArray.Contains(doc.BaseURL)) 
>                 { 
>                     collectedBaseURLArray.Add(doc.BaseURL); 
>                     totalHits++; 
>                     hq.Insert(new ScoreDoc(doc, score)); 
>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
> minScore 
>                 } 
>             } 
> 
> Does this make sense? 
> 
> How could I tell the search to use my extended version of the
> TopDocCollector class? Also, how would I pull the URL from the document
> inside of the loop above? I didn't see any good documentation anywhere on
> how to do that. There seems to be little information out there on how to
> build your own custom collector. 
> 
> Thanks again, 
> Mike 
> 
> 
> Anshum-2 wrote:
>> 
>> Hi Mike,
>> Not really through queries, but you may do this by writing a custom
>> collector. You'd need some supporting data structure to mark/hash the
>> occurrence of a domain in your result set.
>> 
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>> 
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>> 
>> 
>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mikepolzin@yahoo.com> wrote:
>> 
>>> I am working on building a web search engine and I would like to build a
>>> reults page similar to what Google does. The functionality I am looking
>>> to
>>> include is what I refer to a "rolling up" sites, meaning that even if a
>>> particular site (defined by its base URL) has many relevent hits on
>>> various
>>> pages for the searches keywords, that site is only shown once in the
>>> results
>>> listing with a link to the most relevent hit on that site. What I do not
>>> want is to have one site dominate a search results page.
>>>
>>> Does it make sense to just do the search, get the hits list and then
>>> programatically remove the results which, although they meet the search
>>> criteria, are not as relevent? Is there a way to do this through
>>> queries?
>>>
>>> Thanks in advance!
>>>
>>> Mike
>>>
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message