From java-dev-return-13624-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Tue May 02 20:55:11 2006 Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 4495 invoked from network); 2 May 2006 20:55:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 2 May 2006 20:55:10 -0000 Received: (qmail 38663 invoked by uid 500); 2 May 2006 20:55:08 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 38385 invoked by uid 500); 2 May 2006 20:55:07 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 38374 invoked by uid 99); 2 May 2006 20:55:07 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 May 2006 13:55:07 -0700 X-ASF-Spam-Status: No, hits=0.6 required=10.0 tests=NO_REAL_NAME X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [64.90.160.18] (HELO server1.threattracker.com) (64.90.160.18) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 May 2006 13:55:06 -0700 Received: from [192.168.1.98] (69-38-225-22.ny.towerstream.net [69.38.225.22]) (authenticated) by server1.threattracker.com (8.11.6/8.11.6) with ESMTP id k42KsuZ05256 for ; Tue, 2 May 2006 16:54:56 -0400 Message-ID: <4457C72A.2030507@alias-i.com> Date: Tue, 02 May 2006 16:55:06 -0400 From: carp@alias-i.com User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Returning a minimum number of clusters References: <5AD9ECB4-9CE4-42EE-91AC-53CB987130A4@rectangular.com> <44564783.4020103@apache.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Marvin Humphrey wrote: >> BTW, clustering in Information Retrieval usually implies grouping by >> vector distance using statistical methods: >> >> http://en.wikipedia.org/wiki/Data_clustering In general, all you need is objects with a pairwise similarity (dissimilarity) measure. With (term) vectors, that's usually one of the multitude of TF/IDF cosine measures, whereas in other machine learning apps it's typically Euclidean distance (often z-score normalized to scale the dimensions). For the more sophisticated clustering algorithms, like EM (soft/model-based) clustering, you can use similarities between clusters (instead of deriving these from similarities between items). > Exactly. I'd scanned this, but I haven't yet familiarized myself with > the different models. > > It may be possible for both keyword fields e.g. "host" and non- keyword > fields e.g. "content" to be clustered using the same algorithm and an > interface like Hits.cluster(String fieldname, int docsPerCluster). > Retrieve each hit's vector for the specified field, and map the docs > into a unified term space, then cluster. For "host" or any other > keyword field, the boundaries will be stark and the cost of calculation > negligible. For "content", a more sophisticated model will be required > to group the docs and the cost will be greater. This is an issue of scaling the different dimensions. You can "boost" the dimensions any way you want just like other vector-based search operations. > It is more expensive to calculate similarity based on the entire > document's contents rather than just a snippet chosen by the > Highlighter. However, it's presumably more accurate, and having the > Term Vectors pre-built at index time should help quite a bit. This varies, actually, depending on the document. If you grab HTML from a portal, and use it all, pages from that portal will tend to cluster together. If you just use snippets of text around document passages that match your query, you can actually get more accurate clustering relative to your query. It really depends if the documents are single-topic and coherent. If so, use them all; if not, use snippets. [You can see this problem leading the Google news classifier astray on occasion.] A typical way to approximate is by only taking high TF/IDF terms. Principal component methods are also popular (e.g. latent semantic indexing) to reduce dimensionality (usually with a least-squares fit criterion). A more extreme way to approximate is with signature files (e.g. to do web-scale "more documents like this"), but Lucene's not going to help you there. Check out "Managing Gigabytes" for more on this approach. - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org