mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: String clustering and other newbie questions
Date Fri, 28 Aug 2009 18:15:22 GMT
To cluster strings, you need to have a distance between "centroids" and
strings.  The DP clustering stuff could handle this, but not the rest of the
clustering.  The way that it would work in DP would be that there would be
parametrized models that describe probabilities of generating strings
instead of just being multi-dimensional points.  The similarity of a string
to a model is interpreted as the probability of the string given the model.

On Fri, Aug 28, 2009 at 11:09 AM, Jeff Eastman
<jdog@windwardsolutions.com>wrote:

> Well, all of the clustering code is based upon clustering points in an
> n-dimensional vector space and all of the APIs operate upon Vectors. We do
> support the ability to attach a label binding Map to a Vector which can map
> Strings into integer index values. Once this has been done you can access
> the vector values symbolically. I'm not sure this will help with your
> problem and you may need to write your own Canopy.
>
> If you can post some examples of the values you wish to cluster and
> something of your distance measure then I will see if I can figure out a way
> to help you further.
>
> Jeff
>
>
>
> Juan Francisco Contreras Gaitan wrote:
>
>> Thank you so much for your quick reply.
>>
>> Unfortunately, I'm afraid that there is no way of massaging my strings
>> into doubles, because the distance measure would have no sense in terms of
>> doubles. Could you please give me some clue to write the required code in
>> order to solve this difficulty?
>>
>> Thank you very much again.
>>
>> Regards,
>> jfcg
>>
>>
>>
>>> Date: Fri, 28 Aug 2009 08:49:38 -0700
>>> From: jdog@windwardsolutions.com
>>> To: mahout-user@lucene.apache.org
>>> Subject: Re: String clustering and other newbie questions
>>>
>>> Juan Francisco Contreras Gaitan wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I would like to do some clustering by using Hadoop and I found Mahout. I
>>>> am really impressed, but as a newbie I got stuck and I have several
>>>> questions. The idea is to do string clustering: I have properties values
>>>> expressed as strings of some resources, and I would like to aggregate these
>>>> resources. I use Eclipse as IDE, and I have two Mahout working projects,
one
>>>> with release version (0.1) and the other one with SVN version. I am able
to
>>>> compile examples and to run them on my own Hadoop cluster. I have focused
on
>>>> Synthetic Control Data example using Canopy algorithm because of its
>>>> similarity to my problem.
>>>>
>>>> - on release version with default parameter values I get all the items
>>>> on the same cluster (C1), is it normal?
>>>>
>>>>
>>> Are you running the Synthetic Control example data here? That example - I
>>> just ran it on trunk - should produce 6 clusters in one file. It is binary
>>> encoded though, and difficult to interpret in textual representation. If you
>>> search for the string 'SparseVector' in the canopies/part-0000 file you
>>> should see six instances.
>>>
>>>
>>>> - on SVN version I don't have a readable output because there is no
>>>> implemented OutputDriver. If I use the same as release version, I got
>>>> exceptions (I think that format has changed between releases, for example
>>>> using '{' symbol instead of '[')
>>>>
>>>>
>>> The output formats of all the clustering routines are now sequence files
>>> which are binary encoded. The old OutputDriver won't handle it.
>>>
>>>
>>>> - I use string values instead of double values. I have implemented my
>>>> own string distance that returns a double when parameters are string, but
I
>>>> think that Mahout Vectors are implemented just to store double values. Is
>>>> there any chance to use string values?
>>>>
>>>>
>>> Vectors are double only and you will need to massage your data into
>>> numeric format to use out of the box clustering. Is there a way to convert
>>> your property values into doubles?
>>>
>>>
>>>> I would be very grateful if anyone could help me.
>>>>
>>>>
>>> I'm going to be working on converting clustering to Hadoop 0.20 in the
>>> next weeks. Let's continue our dialog.
>>>
>>>
>>>> Thank you very much in advance.
>>>>
>>>> Regards,
>>>> jfcg
>>>>
>>>> _________________________________________________________________
>>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
>>>> http://www.vivelive.com/emoticonos3d/index2.html
>>>>
>>>>
>>>
>> _________________________________________________________________
>> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
>> http://events.es.msn.com/noticias/internet-explorer-8/
>>
>>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message