mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Two quick Mahout questions
Date Fri, 22 Jul 2011 05:23:10 GMT
Copying dev list with Brett's permission.

On Thu, Jul 21, 2011 at 8:00 PM, Brett Wines <brett.wines@seravia.com>wrote:

>
> I'm writing in response to a question on a Mahout forum<http://search.lucidimagination.com/search/document/943a194edea159fc/string_clustering_and_other_newbie_questions>;
> I was wondering if you could answer a question or two for me?
>

Sure.


> First, do you know if there's a good way to plug in one's own
> centroid-computing function for Mahout algorithms like k-means or EM?
>

Actually, I am not entirely sure.  I think that there is.

It is definitely true that there is a good way to plug in a new distance
function for computing cluster membership.

Jeff, is it easy to plug in a new centroid function?  I think that you said
yes to this as part of the classification/clustering unification work.

Second, do you know if there's any way at all to run Mahout clustering
> algorithms on things where the features aren't numbers? The vectors don't
> support anything except for doubles and it and it's hack-y and messy to map
> non-numerical feature data to arbitrary numbers and then in a custom
> distance function undo the mapping (the DistanceMeasure interface requires
> the comparison function to take in Vectors as parameters) and there's got to
> be a better solution.
>


Hmm... I am not clear on all of your requirements, but there are at least
two methods for doing this.

One method that is commonly used to do this is to do classic vector space
conversion of text-like data.  With this, there is one dimension in the
feature vector per unique word.  There wide support for this with clustering
in Mahout.  This is also easy to reverse engineer, but it doesn't support
stupendous or open-ended vocabularies.

Another method is to use hash-encoding.  This allows combinations of
continuous, text-like and word-like data into a fixed size vector that is
merely large instead of stupendous in size.  This representation is nice and
consistent, but it can be difficult to reverse engineer.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message