mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Text clustering with hashing vector encoders
Date Tue, 18 Mar 2014 21:40:31 GMT
Hi all,

Would it be possible to use hashing vector encoders for text clustering
just like when classifying?

Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the cluster labels.
This is quite a complex process where multiple outputs are read and written
in the entire clustering process.

I think it would be great if both algorithms could use the same encoding
process but I don't know if this is possible.

The problem is that we lose the mapping between token and position when
hashing. We need this mapping to determine cluster labels.

However, maybe we could make it so hashed encoders can be used and that
determining top labels is left to the user. This might be a possibility
because I noticed a problem with the current cluster labeling code. This is
what happens: first vectors are vectorized with TF-IDF and clustered. Then
the labels are ranked, but again according to TF-IDF, instead of TF. So it
is possible that a token becomes the top ranked label, even though it is
rare within the cluster. The document with that token is in the cluster
because of other tokens. If the labels are determined based on a TF score
within the cluster I think you would have better labels. But this requires
a post-processing step on your original data and doing a TF count.

Thoughts?

Cheers,

Frank

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message