Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 81514 invoked from network); 11 Apr 2004 17:10:06 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 11 Apr 2004 17:10:06 -0000 Received: (qmail 7353 invoked by uid 500); 11 Apr 2004 17:09:53 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 7331 invoked by uid 500); 11 Apr 2004 17:09:53 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 7306 invoked from network); 11 Apr 2004 17:09:53 -0000 Received: from unknown (HELO p2.rocketsense.us) (66.98.198.71) by daedalus.apache.org with SMTP; 11 Apr 2004 17:09:53 -0000 Received: (qmail 28145 invoked from network); 11 Apr 2004 17:03:01 -0000 Received: from localhost (127.0.0.1) by localhost with SMTP; 11 Apr 2004 17:03:01 -0000 Received: from cvg-65-27-210-182.cinci.rr.com (cvg-65-27-210-182.cinci.rr.com [65.27.210.182]) by webmail.ckhill.com (IMP) with HTTP for ; Sun, 11 Apr 2004 13:03:00 -0400 Message-ID: <1081702980.40797a45018bb@webmail.ckhill.com> Date: Sun, 11 Apr 2004 13:03:01 -0400 From: kevin@ckhill.com To: Lucene Users List Subject: Re: clustering results References: <00b101c41e91$1532d2c0$514011ac@LOOKSMART10574> <013501c41f78$e1982e40$0300a8c0@attbi.com> In-Reply-To: <013501c41f78$e1982e40$0300a8c0@attbi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) 3.2.1 X-Originating-IP: 65.27.210.182 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I got all excited reading the subject line "clustering results" but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. A brief definition of clustering: automatically organizing search or database query results into meaningful hierarchical folders ... transforming long lists of search results into categorized information without any clumsy pre- processing of the source documents. I'm not sure how it would be done...? Based off of top Term Frequencies for a document? -K Quoting "Michael A. Schoen" : > So as Venu pointed out, sorting doesn't seem to help the problem. If we have > to walk the result set, access docs and dedupe using brute force, we're > better off w/ the standard order by relevance. > > If you've got an example of this type of clustering done in a more efficient > way, that'd be great. > > Any other ideas? > > > ----- Original Message ----- > From: "Erik Hatcher" > To: "Lucene Users List" > Sent: Saturday, April 10, 2004 12:35 AM > Subject: Re: clustering results > > > > On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: > > > I have an index of urls, and need to display the top 10 results for a > > > given query, but want to display only 1 result per domain. It seems > > > that using either Hits or a HitCollector, I'll need to access the doc, > > > grab the domain field (I'll have it parse ahead of time) and only > > > take/display documents that are unique. > > > > > > A significant percentage of the time I expect I may have to access > > > thousands of results before I find 10 in unique domains. Is there a > > > faster approach that won't require accessing thousands of documents? > > > > I have examples of this that I can post when I have more time, but a > > quick pointer... check out the overloaded IndexSearcher.search() > > methods which accept a Sort. You can do really really interesting > > slicing and dicing, I think, using it. Try this one on for size: > > > > example.displayHits(allBooks, > > new Sort(new SortField[]{ > > new SortField("category"), > > SortField.FIELD_SCORE, > > new SortField("pubmonth", SortField.INT, true) > > })); > > > > Be clever indexing the piece you want to group on - I think you may > > find this the solution you're looking for. > > > > Erik > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org