Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 17305 invoked from network); 11 Nov 2003 19:27:57 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 11 Nov 2003 19:27:57 -0000 Received: (qmail 93279 invoked by uid 500); 11 Nov 2003 19:27:41 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 93241 invoked by uid 500); 11 Nov 2003 19:27:41 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 93228 invoked from network); 11 Nov 2003 19:27:41 -0000 Received: from unknown (HELO gremlin.ics.uci.edu) (128.195.1.70) by daedalus.apache.org with SMTP; 11 Nov 2003 19:27:41 -0000 Received: from ics.uci.edu (pv105178.reshsg.uci.edu [128.195.105.178]) by gremlin.ics.uci.edu (8.12.10/8.12.10) with ESMTP id hABJQILE017171 for ; Tue, 11 Nov 2003 11:26:19 -0800 (PST) Date: Tue, 11 Nov 2003 11:31:10 -0800 Subject: Re: Document Clustering Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v552) From: "Joshua O'Madadhain" To: "Lucene Users List" Content-Transfer-Encoding: 7bit In-Reply-To: <000f01c3a886$c99a1bf0$0401a8c0@Tatana> Message-Id: <9A95D6F0-147D-11D8-84CA-000A9591BCE8@ics.uci.edu> X-Mailer: Apple Mail (2.552) X-ICS-MailScanner: Found to be clean X-ICS-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=-119.3, required 5, EMAIL_ATTRIBUTION, IN_REP_TO, MIME_EXCESSIVE_QP, QUOTED_EMAIL_TEXT, REPLY_WITH_QUOTES, USER_AGENT_APPLEMAIL, USER_IN_WHITELIST) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote: > Stefan Groschupf wrote: >> Hi, >>> How is document clustering different/related to text categorization? >> >> Clustering: try to find own categories and put documents that match >> in it. You group all documents with minimal distance together. > > Would I be correct to say that you have to define a "distance > threshold" > parameter in order to define when to build a new category for a certain > group? Depends on the type of clustering algorithm. Some clustering algorithms take the number of clusters as a parameter (in this case the algorithm may be run several times with different values, to determine the best value). Other types of algorithms, such as hierarchical agglomerative clustering algorithms, work more as you suggest. Regards, Joshua O'Madadhain jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org