Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <27456444.post@talk.nabble.com>
Date: Thu, 4 Feb 2010 09:23:43 -0800 (PST)
From: mpolzin <mikepolzin@yahoo.com>
To: java-user@lucene.apache.org
Subject: Re: Limiting search result for web search engine
In-Reply-To: <8c4e68611002040314lf2a6cf0he7e0c62eafcb9db8@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
References: <805803.47041.qm@web53208.mail.re2.yahoo.com>
 <867513fe1002021757t53533a49he725a0dda1a77b8@mail.gmail.com>
 <27447319.post@talk.nabble.com> <27447903.post@talk.nabble.com>
 <8c4e68611002040314lf2a6cf0he7e0c62eafcb9db8@mail.gmail.com>


Ian,

Yes, this makes sense, my guess is that by creating a custom collector and
in my overridden Collect method looking up each document by the docid to ge=
t
the base URL is going to create a fairly significant performance hit.  And
from the sounds of your response there is no guarantee that the documents
are being passed to the collector in order of highest score? Is that true?

If they were passed in order of highest score I could implement some code t=
o
only perform that logic for the first x documents depending on what is bein=
g
viewed (1 - 10; 11 - 20; etc). In most cases people never search past the
first few pages anyhow.

Of course this sort of logic could, as you say, easily be done once post
search, which is also the simplest approach. But then again, sometimes less
is more.

Mike


Ian Lea wrote:
>=20
> Writing a custom collector is pretty straightforward.  There is an
> example in the javadocs for Collector.  Use it via
> Searcher.search(query, collector) or search(query, filter, collector).
>=20
> The docid is passed to the collect() method and you can use that to
> get at the document and thus the URL, via your searcher or index
> reader.  But there are performance implications to doing it this way -
> you'll be looking at the URL for all hits, not just the top n that I
> imagine you will be displaying.  If the index is big and you'll be
> getting lots of hits that is likely to be a problem.  FieldCache might
> help.
>=20
> I think that I'd move your deduping logic to after the search and set
> a limit on the number of hits that you check.  That way you'd also get
> the best hit first.
>=20
>=20
> --
> Ian.
>=20
>=20
> On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mikepolzin@yahoo.com> wrote:
>>
>> I changed one line below... realized I missed the ! (NOT).. corrected in
>> original reply.
>>
>>
>> =C2=A0if ((hq.Size() < numHits || score >=3D minScore) =C2=A0&&
>> !collectedBaseURLArray.Contains(doc.BaseURL))
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
>>
>> mpolzin wrote:
>>>
>>>
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (score > 0.0f)
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
>>>
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // Do something=
 here to get the document base URL
>>> (doc.BaseURL)
>>>
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((hq.Size() =
< numHits || score >=3D minScore) =C2=A0&&
>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c=
ollectedBaseURLArray.Add(doc.BaseURL);
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 t=
otalHits++;
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 h=
q.Insert(new ScoreDoc(doc, score));
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 m=
inScore =3D ((ScoreDoc) hq.Top()).score; // maintain
>>> minScore
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
>>>
>>> Does this make sense?
>>>
>>> How could I tell the search to use my extended version of the
>>> TopDocCollector class? Also, how would I pull the URL from the document
>>> inside of the loop above? I didn't see any good documentation anywhere
>>> on
>>> how to do that. There seems to be little information out there on how t=
o
>>> build your own custom collector.
>>>
>>> Thanks again,
>>> Mike
>>>
>>>
>>> Anshum-2 wrote:
>>>>
>>>> Hi Mike,
>>>> Not really through queries, but you may do this by writing a custom
>>>> collector. You'd need some supporting data structure to mark/hash the
>>>> occurrence of a domain in your result set.
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mikepolzin@yahoo.com>
>>>> wrote:
>>>>
>>>>> I am working on building a web search engine and I would like to buil=
d
>>>>> a
>>>>> reults page similar to what Google does. The functionality I am
>>>>> looking
>>>>> to
>>>>> include is what I refer to a "rolling up" sites, meaning that even if
>>>>> a
>>>>> particular site (defined by its base URL) has many relevent hits on
>>>>> various
>>>>> pages for the searches keywords, that site is only shown once in the
>>>>> results
>>>>> listing with a link to the most relevent hit on that site. What I do
>>>>> not
>>>>> want is to have one site dominate a search results page.
>>>>>
>>>>> Does it make sense to just do the search, get the hits list and then
>>>>> programatically remove the results which, although they meet the
>>>>> search
>>>>> criteria, are not as relevent? Is there a way to do this through
>>>>> queries?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp274=
30155p27447903.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
>=20

--=20
View this message in context: http://old.nabble.com/Limiting-search-result-=
for-web-search-engine-tp27430155p27456444.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org