Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C930718915 for ; Tue, 21 Jul 2015 01:47:27 +0000 (UTC) Received: (qmail 3067 invoked by uid 500); 21 Jul 2015 01:47:10 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 2996 invoked by uid 500); 21 Jul 2015 01:47:10 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 2983 invoked by uid 99); 21 Jul 2015 01:47:10 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 01:47:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 99C9F18A299 for ; Tue, 21 Jul 2015 01:47:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.792 X-Spam-Level: ** X-Spam-Status: No, score=2.792 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, KAM_INFOUSMEBIZ=0.75, RCVD_IN_MSPIKE_H2=-1.108, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id fosaQSIT6s3c for ; Tue, 21 Jul 2015 01:47:01 +0000 (UTC) Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id D763C43E2E for ; Tue, 21 Jul 2015 01:47:00 +0000 (UTC) Received: by qgii95 with SMTP id i95so50382787qgi.2 for ; Mon, 20 Jul 2015 18:45:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=GQEMELDx2tQRFAXL0rsEpzUHfoxFzZ61wVr/N+ooMPA=; b=l84u8S4oG7bCM/o/T2UNzvNF4ylsGL3MdCzJS6p78ZKIl6eDSfaPOJWJIHJZkx6JTA QW8gWgdgoPpKRZ9cZJzyIDTAWFdGoByeuiV1YHg/2PoKMz1yVUjp1Eex3yq4l6+5lJ7m QBjHvemubAJ8b1mXQWdB6x0PiFIB0e010WrcyYWZ3Ya4PnbQvM58/UDL0+0YoA3UsqRP vmhuTr0k+C7iPit86Tv/8f+Fgelrm5Nk89oE2J1KkJHuLe/m6Tb1dprP9GaKsYvLRiOq 9hg7aK/ctxYjHywT4xTGeyeLIgYxIIV1PXbfjBvRGShgvmapqXyFoPC8789uDAnbUhej hODQ== MIME-Version: 1.0 X-Received: by 10.141.28.6 with SMTP id f6mr46086537qhe.15.1437443130678; Mon, 20 Jul 2015 18:45:30 -0700 (PDT) Received: by 10.96.130.41 with HTTP; Mon, 20 Jul 2015 18:45:30 -0700 (PDT) In-Reply-To: References: Date: Tue, 21 Jul 2015 07:15:30 +0530 Message-ID: Subject: Re: Kmeans clusterdump Interpretation From: Ankit Goel To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001a11422ff430656d051b58cfe6 --001a11422ff430656d051b58cfe6 Content-Type: text/plain; charset=UTF-8 That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning wrote: > The most central point in a cluster is often referred to as a medoid > (similar to median, but multi-dimensional). > > The Mahout code does not compute medoids. In general, they are difficult > to compute and implementing a full k-medoid clustering algorithm even more > so. > > > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel > wrote: > > > Oh, I thought kmeans gave me a point vector as a centroid, not a > calculated > > point central to a cluster. I guess in this case I would be looking for > the > > most central point vector (from the index ) that I can use as a > > representative of the cluster. > > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman < > > andrew.musselman@gmail.com> wrote: > > > > > I'm not sure centroid id is even a defined thing, especially since the > > > centroid, in my understanding, is just a point in space, not > necessarily > > a > > > point in your data. > > > > > > Are you trying to find the most-central point in a given cluster? > > > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel > > > wrote: > > > > > > > Hi, > > > > I've been messing with mahout 0.10 and kmeans clustering with a solr > > > 4.6.1 > > > > index. The data is news articles. The --field option for kmeans is > set > > to > > > > "content". The idField is set to "title" (just so i can analyse it > > > faster). > > > > The clusterdump of the kmeans result gives me a proper output, but I > > cant > > > > figure out the id of the vector chosen as the center. There are only > > > 14-15 > > > > articles so I am not hung up about the cluster performance at this > > time. > > > > > > > > I used random seeds for the kmeans commandline. > > > > For reference, this is the commandline cluster dump I am executing > > > > > > > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt > > -b 5 > > > > > > > > The output I get is off the form > > > > > > > > :{"r": > > > > > > > > top terms > > > > > > > > xxxxx==>xxxxx > > > > > > > > Weight : [props - optional]: Point: > > > > > > > > 1.0 : [distance=0.0]: [{"account":0.026}.......other features] > > > > > > > > 1.0 : [distance=0.3963903651622338]: [....] > > > > > > > > > > > > So how exactly do I get the centroid id? I have even tried accessing > it > > > > with java > > > > > > > > ClusterWritable value.getValue().getCenter() but this just gives me > the > > > > features and values of the centroid. > > > > > > > > Also, please do explain the meaning of "account":0.026 (just making > > sure > > > I > > > > know it right). I used tfidf. > > > > > > > > -- > > > > Regards, > > > > Ankit Goel > > > > http://about.me/ankitgoel > > > > > > > > > > > > > > > -- > > Regards, > > Ankit Goel > > http://about.me/ankitgoel > > > -- Regards, Ankit Goel http://about.me/ankitgoel --001a11422ff430656d051b58cfe6--