lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Limiting search result for web search engine
Date Thu, 04 Feb 2010 20:10:39 GMT
Mike


Documents do not get passed to Collectors in order of highest score.
It is the job of the collector to gather the top scoring docs, as is
typically required, and implemented by TopScoreDocCollector for the
most commonly used search method calls (according to the javadocs -
read the javadocs!). Other collectors can do whatever they want with
matching documents.

On the performance issue of reading document fields for each hit,
there is a specific warning against it in the javadocs for Collector.


--
Ian.


On Thu, Feb 4, 2010 at 5:23 PM, mpolzin <mikepolzin@yahoo.com> wrote:
>
> Ian,
>
> Yes, this makes sense, my guess is that by creating a custom collector and
> in my overridden Collect method looking up each document by the docid to get
> the base URL is going to create a fairly significant performance hit.  And
> from the sounds of your response there is no guarantee that the documents
> are being passed to the collector in order of highest score? Is that true?
>
> If they were passed in order of highest score I could implement some code to
> only perform that logic for the first x documents depending on what is being
> viewed (1 - 10; 11 - 20; etc). In most cases people never search past the
> first few pages anyhow.
>
> Of course this sort of logic could, as you say, easily be done once post
> search, which is also the simplest approach. But then again, sometimes less
> is more.
>
> Mike
>
>
> Ian Lea wrote:
>>
>> Writing a custom collector is pretty straightforward.  There is an
>> example in the javadocs for Collector.  Use it via
>> Searcher.search(query, collector) or search(query, filter, collector).
>>
>> The docid is passed to the collect() method and you can use that to
>> get at the document and thus the URL, via your searcher or index
>> reader.  But there are performance implications to doing it this way -
>> you'll be looking at the URL for all hits, not just the top n that I
>> imagine you will be displaying.  If the index is big and you'll be
>> getting lots of hits that is likely to be a problem.  FieldCache might
>> help.
>>
>> I think that I'd move your deduping logic to after the search and set
>> a limit on the number of hits that you check.  That way you'd also get
>> the best hit first.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mikepolzin@yahoo.com> wrote:
>>>
>>> I changed one line below... realized I missed the ! (NOT).. corrected in
>>> original reply.
>>>
>>>
>>>  if ((hq.Size() < numHits || score >= minScore)  &&
>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>>                {
>>>
>>> mpolzin wrote:
>>>>
>>>>
>>>>             if (score > 0.0f)
>>>>             {
>>>>
>>>>                 // Do something here to get the document base URL
>>>> (doc.BaseURL)
>>>>
>>>>                 if ((hq.Size() < numHits || score >= minScore)
 &&
>>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>>>                 {
>>>>                     collectedBaseURLArray.Add(doc.BaseURL);
>>>>                     totalHits++;
>>>>                     hq.Insert(new ScoreDoc(doc, score));
>>>>                     minScore = ((ScoreDoc) hq.Top()).score; //
maintain
>>>> minScore
>>>>                 }
>>>>             }
>>>>
>>>> Does this make sense?
>>>>
>>>> How could I tell the search to use my extended version of the
>>>> TopDocCollector class? Also, how would I pull the URL from the document
>>>> inside of the loop above? I didn't see any good documentation anywhere
>>>> on
>>>> how to do that. There seems to be little information out there on how to
>>>> build your own custom collector.
>>>>
>>>> Thanks again,
>>>> Mike
>>>>
>>>>
>>>> Anshum-2 wrote:
>>>>>
>>>>> Hi Mike,
>>>>> Not really through queries, but you may do this by writing a custom
>>>>> collector. You'd need some supporting data structure to mark/hash the
>>>>> occurrence of a domain in your result set.
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mikepolzin@yahoo.com>
>>>>> wrote:
>>>>>
>>>>>> I am working on building a web search engine and I would like to
build
>>>>>> a
>>>>>> reults page similar to what Google does. The functionality I am
>>>>>> looking
>>>>>> to
>>>>>> include is what I refer to a "rolling up" sites, meaning that even
if
>>>>>> a
>>>>>> particular site (defined by its base URL) has many relevent hits
on
>>>>>> various
>>>>>> pages for the searches keywords, that site is only shown once in
the
>>>>>> results
>>>>>> listing with a link to the most relevent hit on that site. What I
do
>>>>>> not
>>>>>> want is to have one site dominate a search results page.
>>>>>>
>>>>>> Does it make sense to just do the search, get the hits list and then
>>>>>> programatically remove the results which, although they meet the
>>>>>> search
>>>>>> criteria, are not as relevent? Is there a way to do this through
>>>>>> queries?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27456444.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message