Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 98693 invoked from network); 4 Feb 2010 17:24:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Feb 2010 17:24:15 -0000 Received: (qmail 36968 invoked by uid 500); 4 Feb 2010 17:24:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 36879 invoked by uid 500); 4 Feb 2010 17:24:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36869 invoked by uid 99); 4 Feb 2010 17:24:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 17:24:12 +0000 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=FORGED_YAHOO_RCVD,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 17:24:04 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1Nd5Qd-0007n4-Ff for java-user@lucene.apache.org; Thu, 04 Feb 2010 09:23:43 -0800 Message-ID: <27456444.post@talk.nabble.com> Date: Thu, 4 Feb 2010 09:23:43 -0800 (PST) From: mpolzin To: java-user@lucene.apache.org Subject: Re: Limiting search result for web search engine In-Reply-To: <8c4e68611002040314lf2a6cf0he7e0c62eafcb9db8@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Nabble-From: mikepolzin@yahoo.com References: <805803.47041.qm@web53208.mail.re2.yahoo.com> <867513fe1002021757t53533a49he725a0dda1a77b8@mail.gmail.com> <27447319.post@talk.nabble.com> <27447903.post@talk.nabble.com> <8c4e68611002040314lf2a6cf0he7e0c62eafcb9db8@mail.gmail.com> Ian, Yes, this makes sense, my guess is that by creating a custom collector and in my overridden Collect method looking up each document by the docid to ge= t the base URL is going to create a fairly significant performance hit. And from the sounds of your response there is no guarantee that the documents are being passed to the collector in order of highest score? Is that true? If they were passed in order of highest score I could implement some code t= o only perform that logic for the first x documents depending on what is bein= g viewed (1 - 10; 11 - 20; etc). In most cases people never search past the first few pages anyhow. Of course this sort of logic could, as you say, easily be done once post search, which is also the simplest approach. But then again, sometimes less is more. Mike Ian Lea wrote: >=20 > Writing a custom collector is pretty straightforward. There is an > example in the javadocs for Collector. Use it via > Searcher.search(query, collector) or search(query, filter, collector). >=20 > The docid is passed to the collect() method and you can use that to > get at the document and thus the URL, via your searcher or index > reader. But there are performance implications to doing it this way - > you'll be looking at the URL for all hits, not just the top n that I > imagine you will be displaying. If the index is big and you'll be > getting lots of hits that is likely to be a problem. FieldCache might > help. >=20 > I think that I'd move your deduping logic to after the search and set > a limit on the number of hits that you check. That way you'd also get > the best hit first. >=20 >=20 > -- > Ian. >=20 >=20 > On Thu, Feb 4, 2010 at 5:23 AM, mpolzin wrote: >> >> I changed one line below... realized I missed the ! (NOT).. corrected in >> original reply. >> >> >> =C2=A0if ((hq.Size() < numHits || score >=3D minScore) =C2=A0&& >> !collectedBaseURLArray.Contains(doc.BaseURL)) >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{ >> >> mpolzin wrote: >>> >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (score > 0.0f) >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 { >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // Do something= here to get the document base URL >>> (doc.BaseURL) >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((hq.Size() = < numHits || score >=3D minScore) =C2=A0&& >>> !collectedBaseURLArray.Contains(doc.BaseURL)) >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 { >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c= ollectedBaseURLArray.Add(doc.BaseURL); >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 t= otalHits++; >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 h= q.Insert(new ScoreDoc(doc, score)); >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 m= inScore =3D ((ScoreDoc) hq.Top()).score; // maintain >>> minScore >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } >>> >>> Does this make sense? >>> >>> How could I tell the search to use my extended version of the >>> TopDocCollector class? Also, how would I pull the URL from the document >>> inside of the loop above? I didn't see any good documentation anywhere >>> on >>> how to do that. There seems to be little information out there on how t= o >>> build your own custom collector. >>> >>> Thanks again, >>> Mike >>> >>> >>> Anshum-2 wrote: >>>> >>>> Hi Mike, >>>> Not really through queries, but you may do this by writing a custom >>>> collector. You'd need some supporting data structure to mark/hash the >>>> occurrence of a domain in your result set. >>>> >>>> -- >>>> Anshum Gupta >>>> Naukri Labs! >>>> http://ai-cafe.blogspot.com >>>> >>>> The facts expressed here belong to everybody, the opinions to me. The >>>> distinction is yours to draw............ >>>> >>>> >>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin >>>> wrote: >>>> >>>>> I am working on building a web search engine and I would like to buil= d >>>>> a >>>>> reults page similar to what Google does. The functionality I am >>>>> looking >>>>> to >>>>> include is what I refer to a "rolling up" sites, meaning that even if >>>>> a >>>>> particular site (defined by its base URL) has many relevent hits on >>>>> various >>>>> pages for the searches keywords, that site is only shown once in the >>>>> results >>>>> listing with a link to the most relevent hit on that site. What I do >>>>> not >>>>> want is to have one site dominate a search results page. >>>>> >>>>> Does it make sense to just do the search, get the hits list and then >>>>> programatically remove the results which, although they meet the >>>>> search >>>>> criteria, are not as relevent? Is there a way to do this through >>>>> queries? >>>>> >>>>> Thanks in advance! >>>>> >>>>> Mike >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> -- >> View this message in context: >> http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp274= 30155p27447903.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 >=20 --=20 View this message in context: http://old.nabble.com/Limiting-search-result-= for-web-search-engine-tp27430155p27456444.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org