Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.216.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAK9GPcRnMgpUi1arb446FAq2OYimpp3oKtqbuHuqyaLHCXo_yw@mail.gmail.com>
References: 
 <CAMHOx9atCtVextvQGKF3U-09tQbj9ya+P8g8+UrAuYRt8ps1ow@mail.gmail.com>
 <CAK9GPcRWZtX5xWQHs9q7nGpFjiYCOpu11C+xc783BGM6gpF6FQ@mail.gmail.com>
 <CAMHOx9Z11V57jOu1A-jUkZ25aUhe3SnSooPfMreFFyri=HU1xw@mail.gmail.com>
 <CAK9GPcQhjAoHm2YYPXsioMLYUqDLNs0Pgs=YYRw3GM=rx9i1QQ@mail.gmail.com>
 <CAK9GPcRnMgpUi1arb446FAq2OYimpp3oKtqbuHuqyaLHCXo_yw@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Mon, 11 Jul 2011 21:36:32 -0700
Message-ID: 
 <CAJwFCa2SpKjtSw06WB7zeOm+GQuHh1+1mQxEd8=haq=5MgNeEw@mail.gmail.com>
Subject: Re: Clustering with id
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=0016360e3b6249f38704a7d7d92b

--0016360e3b6249f38704a7d7d92b
Content-Type: text/plain; charset=UTF-8

Can you give specific examples?  The process should be relatively
straightforward and the implication that rows have row labels that are
defined by the left operand of a product and columns have column labels that
are defined by the right operand should be sufficient.  Sums should have the
same row and column labels if any.  From these constraints everything else
should follow.

On Mon, Jul 11, 2011 at 8:59 PM, Lance Norskog <goksron@gmail.com> wrote:

> I mean, walking through the algorithms and tracking what vector name
> becomes what matrix row/column label.
>
> On Mon, Jul 11, 2011 at 8:58 PM, Lance Norskog <goksron@gmail.com> wrote:
> > I'm finding it hard to maintain these labels across vector and matrix
> > factorizations & direct operations.
> >
> > On Mon, Jul 11, 2011 at 1:10 AM, Gabor Makrai <makrai.list@gmail.com>
> wrote:
> >> Thank you very much! NamedVector has to solve my problem!
> >> Anyway, I'm always wondering the answer speed in the Hadoop lists!
> >>
> >> Thank you,
> >> Gabor
> >>
> >> On Mon, Jul 11, 2011 at 3:51 AM, Lance Norskog <goksron@gmail.com>
> wrote:
> >>
> >>> The NamedVector class adds a string to any vector, forwarding all
> >>> methods to the wrapped vector. You can cluster these, and then pull
> >>> the strings. The clustering algorithm operates on the wrapped vector.
> >>>
> >>> Lance
> >>>
> >>> On Sun, Jul 10, 2011 at 4:18 PM, Gabor Makrai <makrai.list@gmail.com>
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > I'm a little bit confused about Mahout's clustering algorithms. I
> like to
> >>> > clustering data with id column. How can I do that?
> >>> > For example, I like to run K-Means clustering on the Iris data set (
> >>> > http://archive.ics.uci.edu/ml/datasets/Iris) where I've got four
> >>> numerical
> >>> > columns. I generated an id column to identify the records and when
> the
> >>> > clustering is done, I like to see the results.
> >>> > When I examine the code, I realized that I can create DenseVector
> >>> instances
> >>> > (with the four numberical column, without the id) and write those in
> >>> > VectorWriteable format. These were my input data. After I managed to
> run
> >>> > K-Means, I get IntWritable/WeightedVectorWritable key/value pairs,
> where
> >>> > keys tell me the clusterID. Is it possible to handle ID attribute
> >>> somehow?
> >>> > Maybe the order of the output data is the same as the input data? Can
> >>> anyone
> >>> > confirm this?
> >>> >
> >>> > Thank you very much,
> >>> > Gabor Makrai
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>>
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

--0016360e3b6249f38704a7d7d92b--