mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1117) Vectors are not hashable
Date Fri, 16 Nov 2012 18:14:12 GMT


Sean Owen commented on MAHOUT-1117:

I think the issue is that some subclasses specialize one but not the other. This could lead
to funny behavior (not just slowness) if used in a context where these methods matter. I think
the idea is that they shouldn't be, but, it would be a plus to add (consistent) versions of
these method and/or fix up the points where they are not consistent. That's the essence of
the question here I think.

The problem is that the apparent desired definition of equality involves comparing the weights
for equality up to some epsilon. This isn't transitive to start, and for similar reasons,
it's going to be hard or impossible to write a hashCode() that's consistent with it. You have
to demand strict equality really.

We haven't even gotten into questions of whether a WeightedVector equals() an DenseVector
with the same values. Right now it does. If you leave that behavior but override equals(),
transitivity breaks down again. So that's another thing that, technically, has to change.

You begin to see it's been a small mess for a while...
> Vectors are not hashable
> ------------------------
>                 Key: MAHOUT-1117
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 1.0
>            Reporter: Dan Filimon
>            Priority: Minor
> No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
> In working on improving clustering in Mahout, Ted Dunning wrote prototype code for Streaming
KMeans and Ball KMeans, that I'm working with him on. These need to be used together in the
MapReduce version.
> However, in Ball KMeans, we initialize the clusters using a probabilistic approach similar
to k-means++. This however requires a Multinomial<WeightedVector> distribution of the
points we want to cluster to pick the centroids.
> Internally, the Multinomial<T> uses a HashMap to keep track of the values it can
sample from.
> Since Vectors don't override Object's hashCode(), it is possible to get the same value
multiple times in the map (as long as the references differ).
> This is less of an issue because of how we're adding the vectors to the multinomial (we
can guarantee that the references will be unique) and once MAHOUT-1116 is resolved the hashing
will work okay for our needs.
> It still seems that it would be useful to have hashable vectors.
> What do you think? And what would a hash function look like?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message