lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Limiting search result for web search engine
Date Thu, 04 Feb 2010 11:14:21 GMT
Writing a custom collector is pretty straightforward.  There is an
example in the javadocs for Collector.  Use it via
Searcher.search(query, collector) or search(query, filter, collector).

The docid is passed to the collect() method and you can use that to
get at the document and thus the URL, via your searcher or index
reader.  But there are performance implications to doing it this way -
you'll be looking at the URL for all hits, not just the top n that I
imagine you will be displaying.  If the index is big and you'll be
getting lots of hits that is likely to be a problem.  FieldCache might
help.

I think that I'd move your deduping logic to after the search and set
a limit on the number of hits that you check.  That way you'd also get
the best hit first.


--
Ian.


On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mikepolzin@yahoo.com> wrote:
>
> I changed one line below... realized I missed the ! (NOT).. corrected in
> original reply.
>
>
>  if ((hq.Size() < numHits || score >= minScore)  &&
> !collectedBaseURLArray.Contains(doc.BaseURL))
>                {
>
> mpolzin wrote:
>>
>>
>>             if (score > 0.0f)
>>             {
>>
>>                 // Do something here to get the document base URL
>> (doc.BaseURL)
>>
>>                 if ((hq.Size() < numHits || score >= minScore)  &&
>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>                 {
>>                     collectedBaseURLArray.Add(doc.BaseURL);
>>                     totalHits++;
>>                     hq.Insert(new ScoreDoc(doc, score));
>>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
>> minScore
>>                 }
>>             }
>>
>> Does this make sense?
>>
>> How could I tell the search to use my extended version of the
>> TopDocCollector class? Also, how would I pull the URL from the document
>> inside of the loop above? I didn't see any good documentation anywhere on
>> how to do that. There seems to be little information out there on how to
>> build your own custom collector.
>>
>> Thanks again,
>> Mike
>>
>>
>> Anshum-2 wrote:
>>>
>>> Hi Mike,
>>> Not really through queries, but you may do this by writing a custom
>>> collector. You'd need some supporting data structure to mark/hash the
>>> occurrence of a domain in your result set.
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me. The
>>> distinction is yours to draw............
>>>
>>>
>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mikepolzin@yahoo.com> wrote:
>>>
>>>> I am working on building a web search engine and I would like to build a
>>>> reults page similar to what Google does. The functionality I am looking
>>>> to
>>>> include is what I refer to a "rolling up" sites, meaning that even if a
>>>> particular site (defined by its base URL) has many relevent hits on
>>>> various
>>>> pages for the searches keywords, that site is only shown once in the
>>>> results
>>>> listing with a link to the most relevent hit on that site. What I do not
>>>> want is to have one site dominate a search results page.
>>>>
>>>> Does it make sense to just do the search, get the hits list and then
>>>> programatically remove the results which, although they meet the search
>>>> criteria, are not as relevent? Is there a way to do this through
>>>> queries?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message