mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Text clustering with hashing vector encoders
Date Wed, 19 Mar 2014 19:24:24 GMT
With text hashing, you have an issue because of collisions.  In spite of
this, you get good results and can decrease the dimension of the data
substantially using a single hashed location.

If you use more than one probe, the probability that two words will hash to
exactly the same two locations is much lower than the probability of a
single collision.  This means that there is a much better chance of being
able to distinguish important differences.  Practically, this allows you to
use yet smaller vectors to represent your data.  The down-side of this is
that you will have more non-zero values which increase the amount of math
that you actually need to do.  The trade-off is more complex than it looks
since the models you build from these values often has dense vectors
internally.  Thus, it may be good to hash down to 100K dimensions with
several probes rather than down to 100 million dimensions with a single
probe.

In the limit, you are basically doing a random projection except that by
preserving substantial sparsity, you can still get the benefits of L1
regularization.




On Tue, Mar 18, 2014 at 9:23 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> How does "with multiple probes" affect distance preservation, and how would
> idf weighting get tricky just by hashing strings?
>
> Would we be computing distance between hashed strings, or distance between
> vectors based on counts of hashed strings?
>
>
> On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > +1 to this. We could then use Hamming Distance to compute the distances
> > between Hashed Vectors.
> >
> > We have  the code for HashedVector.java based on Moses Charikar's SimHash
> > paper.
> >
> >
> >
> >
> >
> >
> >
> > On Tuesday, March 18, 2014 7:14 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > Yes.  Hashing vector encoders will preserve distances when used with
> > multiple probes.
> >
> > Interpretation becomes somewhat difficult, but there is code available to
> > reverse engineer labels on hashed vectors.
> >
> > IDF weighting is slightly tricky, but quite doable if you keep a
> dictionary
> > of, say, the most common 50-200 thousand words and assume all others have
> > constant and equal frequency.
> >
> >
> >
> >
> > On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <frank@frankscholten.nl
> > >wrote:
> >
> > > Hi all,
> > >
> > > Would it be possible to use hashing vector encoders for text clustering
> > > just like when classifying?
> > >
> > > Currently we vectorize using a dictionary where we map each token to a
> > > fixed position in the dictionary. After the clustering we use have to
> > > retrieve the dictionary to determine the cluster labels.
> > > This is quite a complex process where multiple outputs are read and
> > written
> > > in the entire clustering process.
> > >
> > > I think it would be great if both algorithms could use the same
> encoding
> > > process but I don't know if this is possible.
> > >
> > > The problem is that we lose the mapping between token and position when
> > > hashing. We need this mapping to determine cluster labels.
> > >
> > > However, maybe we could make it so hashed encoders can be used and that
> > > determining top labels is left to the user. This might be a possibility
> > > because I noticed a problem with the current cluster labeling code.
> This
> > is
> > > what happens: first vectors are vectorized with TF-IDF and clustered.
> > Then
> > > the labels are ranked, but again according to TF-IDF, instead of TF. So
> > it
> > > is possible that a token becomes the top ranked label, even though it
> is
> > > rare within the cluster. The document with that token is in the cluster
> > > because of other tokens. If the labels are determined based on a TF
> score
> > > within the cluster I think you would have better labels. But this
> > requires
> > > a post-processing step on your original data and doing a TF count.
> > >
> > > Thoughts?
> > >
> > > Cheers,
> > >
> > > Frank
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message