Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 99277 invoked from network); 11 Aug 2009 18:02:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Aug 2009 18:02:28 -0000 Received: (qmail 46349 invoked by uid 500); 11 Aug 2009 18:01:31 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 45872 invoked by uid 500); 11 Aug 2009 18:01:29 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 45178 invoked by uid 99); 11 Aug 2009 17:55:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 17:55:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shashikant@gmail.com designates 209.85.222.186 as permitted sender) Received: from [209.85.222.186] (HELO mail-pz0-f186.google.com) (209.85.222.186) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 17:55:01 +0000 Received: by pzk16 with SMTP id 16so3630797pzk.20 for ; Tue, 11 Aug 2009 10:54:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=btQ+bKgsEZBdE5f4qaln6CV9jO2GZXkn6siyslinsvw=; b=QKjFxOnpH4kNRIeKkaa0rsFPwHf6R3NZ0xJkDxSdDiEVHXEy3ZDMJD/MPTHhiLv6Di fzIStdCNJulfXUzPB9LZfSZ+p+YNG1qvOQIObmhdjtiFgG9LFA3ZazfWXbJ8JQ8Nflht PzhnS3uIlqYX+OzxIdfQKJwD+hnQo8XsmtI6I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=A5HJJ+BCUiTLjjBIIwFHG4IPOTlKMXWMRU82nVosPkOUMWblaiSQLr0lBQsJ+wbVeh itk5xFy/wuqwAphvpaZI0FLlcnkuXLYCd60vuwowut6uyeeyiGCDusLBLtMw+xr8WRKb umHuWN2nQ/n0IASRwPJg+wn4iE2uwjX2olTUU= MIME-Version: 1.0 Received: by 10.114.134.14 with SMTP id h14mr1802779wad.126.1250013280095; Tue, 11 Aug 2009 10:54:40 -0700 (PDT) In-Reply-To: References: <610013F0-B082-42B5-8286-E9F09A3299A5@oobaloo.co.uk> <17469b150908100651l3e927acbm495668587db42d50@mail.gmail.com> <17469b150908101111s4839fc2xbca7451a88b86ad4@mail.gmail.com> <17469b150908110132o7acefac3g28c92ca67c5b066c@mail.gmail.com> <17469b150908110429g5c4cb4f5r566b58c0d336ba80@mail.gmail.com> From: Shashikant Kore Date: Tue, 11 Aug 2009 23:24:20 +0530 Message-ID: <17469b150908111054t47b1633bl73c06dda40fba63b@mail.gmail.com> Subject: Re: Methods for Naming Clusters To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning wrote: > If you expand the LLR equation and look at which terms are big, you will = see > k_11 * log(mumble) =A0as an important term for many words. =A0Usually, th= is is > about the same as tf.idf since mumble is about the same as the term > frequency. =A0For a single document, tf.idf is a very close approximation= of > LLR. =A0With many documents, the situation can change dramatically, howev= er, > and you can get cancellation effects that eliminate documents that would > otherwise have high tf.idf. =A0These are generally the terms that lead to > over-fitting with methods like naive bayes and are often not such great > cluster descriptors. > Let me restate what I understood. If a phrase is identified as prominent phrase by LLR and it also happens to be the top-weighted feature in the centroid vector, it is not a good candidate for cluster label. Is this correct? --shashi