mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Preserve contents of keys after running k-means
Date Sat, 06 Jul 2013 16:53:32 GMT
OK, squeaky wheel alert...

When I use kmeans I'm interested primarily in the cluster membership but almost as much in
the distance to the centroid for ordering purposes. I'd also like the cluster list to contain
any secondary vector ids that I've used for the vectors, like names. Pdf makes sense for fuzzy
clustering, where if takes the place of distance to centroid in ordering. If all these optional
values were considered property lists that are attached to the named or ided vector then they
might be kept with the vectors too so they would follow them through any further processing.

On Jul 5, 2013, at 10:28 PM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

I want to have the core feature of k-means which is to find out which vectors landed in what
cluster, and I'm open to discussion beyond that.

Best
Andrew

On Jul 5, 2013, at 5:43 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> I think https://issues.apache.org/jira/browse/MAHOUT-1030 may be the wrong issue #. 
> 
> The problem is that the Names from NamedVectorWritable are not used in the cluster map
after kmeans. You need to maintain your own map of your vector name to internal Mahout id
ints. NamedVectors work all the way through from vector creation out of raw docs, TFIDF weighting,
etc but the Names are not used in id-ing the list of vectors assigned to clusters. 
> 
> It's been an issue on my wish list for Mahout. To get general universal support for named
vectors or better yet property vectors (where any number of properties can be attached to
a vector). A truly scalable non-DB string<->int index creation and lookup (mapreduce
version) is doable but not trivial. If you don't have too many for an in-memory hashmap you
have a much easier time of it.  
> 
> 
> On Jul 5, 2013, at 2:53 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
> Andrew,
> 
> I was being somewhat stupid.  You are talking about a parallel program.
> There is no single counter.
> 
> The row number is what I was referring to.  Each process will have
> consecutive row numbers starting at 0.  These rows will correspond to a
> sequence of rows in the original data.  If you can cause each process to
> record these id's as they go by, you have the thing you need.
> 
> I haven't looked at this code in several years, however, so my suggestions
> may well be quite far from reasonable.
> 
> 
> 
> On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <andrew.musselman@gmail.com
>> wrote:
> 
>> Ted, I'm having a tough time finding the "internal ids" you mentioned..
>> Where are they output?
>> 
>> Thanks
>> 
>> 
>> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
>> andrew.musselman@gmail.com
>>> wrote:
>> 
>>> :)
>>> 
>>> Aha, we were only looking in the points directory, not inside the
>>> clustered points directory.  So if I understand, you're suggesting that
>> we
>>> use the key at the beginning of the clustered points as a one-to-one map.
>>> The number of unique keys in the output doesn't seem to line up with
>> that
>>> in the input.
>>> 
>>> We may do our dumb idea for now until we get a better handle on how the
>>> output is written.
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>> 
>>>> Andrew,
>>>> 
>>>> That is a pretty clever solution.
>>>> 
>>>> I think that you can get by with a simpler solution by noting how the
>>>> internal id's are assigned (sequentially, I think).
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>>>> andrew.musselman@gmail.com
>>>>> wrote:
>>>> 
>>>>> So how are people working around this without patching 0.7?
>>>> Downgrading to
>>>>> 0.6?
>>>>> 
>>>>> We're on a cluster where we don't have admin rights to patch Mahout.
>>>>> 
>>>>> Our dumb idea now is to hash the concatenated values of each vector
>> and
>>>>> pair that up with our original ids, then run another process on the
>>>> points
>>>>> results to hash the results, then join up on hash value to pull id
>>>> together
>>>>> with cluster #.
>>>>> 
>>>>> Anyone have a nicer solution to this at hand?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
>> suneel_marthi@yahoo.com
>>>>>> wrote:
>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>> This feature was available prior to Mahout 0.7 (clustering had
>> support
>>>>> for
>>>>>> Named Vectors) and was broken later. While this may not be fixed
in
>>>> the
>>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this -
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>>>> targeted
>>>>>> for 0.9. Please feel free to submit a patch if you would like to
>> take
>>>> a
>>>>>> shot at it.
>>>>>> 
>>>>>> Suneel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Andrew Musselman <andrew.musselman@gmail.com>
>>>>>> To: user@mahout.apache.org
>>>>>> Sent: Friday, July 5, 2013 3:05 PM
>>>>>> Subject: Preserve contents of keys after running k-means
>>>>>> 
>>>>>> 
>>>>>> Hi list
>>>>>> 
>>>>>> We are trying to do some k-means clustering and are wondering if
>>>> there's
>>>>> an
>>>>>> easy way to preserve the contents of the keys for the input records.
>>>>>> 
>>>>>> E.g.
>>>>>> 
>>>>>> 12345: (0,3,79,80)
>>>>>> 98765: (1,4,98,90)
>>>>>> 
>>>>>> where the vectors being clustered are the tuples and the keys are
>> some
>>>>> id.
>>>>>> 
>>>>>> When we run clusterdump with pointsDir specified we have the vectors
>>>> but
>>>>>> not the keys.  We're looking at NamedVector as a path to this
>>>> solution,
>>>>> as
>>>>>> well as looking at a mapping file between ordered integers and the
>>>> ids in
>>>>>> order.
>>>>>> 
>>>>>> Thanks for any advice.
>>>>>> 
>>>>>> Best
>>>>>> Andrew
> 


Mime
View raw message