Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 84388661A for ; Tue, 12 Jul 2011 04:37:38 +0000 (UTC) Received: (qmail 15736 invoked by uid 500); 12 Jul 2011 04:37:37 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 15155 invoked by uid 500); 12 Jul 2011 04:37:23 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 15113 invoked by uid 99); 12 Jul 2011 04:37:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 04:37:20 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qw0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 04:37:13 +0000 Received: by qwi4 with SMTP id 4so5219542qwi.1 for ; Mon, 11 Jul 2011 21:36:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=KkHNO6MCUX9aepydG84usW4gLrqxP9nPxdjDUkdlG9Y=; b=XR1YtrGzi04JDELqrXM77vJ1qKDiKX6WTU69S08i0GR3dM5Lxxi0EPsoh7cIhv3ADw DiMa0ELeayUe1ktdRIABogB4awUb+rWMFQgMl3S9oiBRiU0H9mibVg0jLhHOpX/owUAT uOipQUFl0eEPAYSwjEQgI1NE/tUFsYlKzFIPM= Received: by 10.224.95.74 with SMTP id c10mr4091939qan.258.1310445412151; Mon, 11 Jul 2011 21:36:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.47.135 with HTTP; Mon, 11 Jul 2011 21:36:32 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Mon, 11 Jul 2011 21:36:32 -0700 Message-ID: Subject: Re: Clustering with id To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0016360e3b6249f38704a7d7d92b X-Virus-Checked: Checked by ClamAV on apache.org --0016360e3b6249f38704a7d7d92b Content-Type: text/plain; charset=UTF-8 Can you give specific examples? The process should be relatively straightforward and the implication that rows have row labels that are defined by the left operand of a product and columns have column labels that are defined by the right operand should be sufficient. Sums should have the same row and column labels if any. From these constraints everything else should follow. On Mon, Jul 11, 2011 at 8:59 PM, Lance Norskog wrote: > I mean, walking through the algorithms and tracking what vector name > becomes what matrix row/column label. > > On Mon, Jul 11, 2011 at 8:58 PM, Lance Norskog wrote: > > I'm finding it hard to maintain these labels across vector and matrix > > factorizations & direct operations. > > > > On Mon, Jul 11, 2011 at 1:10 AM, Gabor Makrai > wrote: > >> Thank you very much! NamedVector has to solve my problem! > >> Anyway, I'm always wondering the answer speed in the Hadoop lists! > >> > >> Thank you, > >> Gabor > >> > >> On Mon, Jul 11, 2011 at 3:51 AM, Lance Norskog > wrote: > >> > >>> The NamedVector class adds a string to any vector, forwarding all > >>> methods to the wrapped vector. You can cluster these, and then pull > >>> the strings. The clustering algorithm operates on the wrapped vector. > >>> > >>> Lance > >>> > >>> On Sun, Jul 10, 2011 at 4:18 PM, Gabor Makrai > >>> wrote: > >>> > Hi, > >>> > > >>> > I'm a little bit confused about Mahout's clustering algorithms. I > like to > >>> > clustering data with id column. How can I do that? > >>> > For example, I like to run K-Means clustering on the Iris data set ( > >>> > http://archive.ics.uci.edu/ml/datasets/Iris) where I've got four > >>> numerical > >>> > columns. I generated an id column to identify the records and when > the > >>> > clustering is done, I like to see the results. > >>> > When I examine the code, I realized that I can create DenseVector > >>> instances > >>> > (with the four numberical column, without the id) and write those in > >>> > VectorWriteable format. These were my input data. After I managed to > run > >>> > K-Means, I get IntWritable/WeightedVectorWritable key/value pairs, > where > >>> > keys tell me the clusterID. Is it possible to handle ID attribute > >>> somehow? > >>> > Maybe the order of the output data is the same as the input data? Can > >>> anyone > >>> > confirm this? > >>> > > >>> > Thank you very much, > >>> > Gabor Makrai > >>> > > >>> > >>> > >>> > >>> -- > >>> Lance Norskog > >>> goksron@gmail.com > >>> > >> > > > > > > > > -- > > Lance Norskog > > goksron@gmail.com > > > > > > -- > Lance Norskog > goksron@gmail.com > --0016360e3b6249f38704a7d7d92b--