mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Minhash key groups
Date Wed, 09 Nov 2011 00:36:01 GMT
Could this project be done with symbol sequences instead of hash codes? The
advantage of symbol sequences is that you can unpack them.

On Tue, Nov 8, 2011 at 9:54 AM, Vishal Santoshi
<vishal.santoshi@gmail.com>wrote:

> Yep.
>
> By concatenating p hash-keys ( generated from p functions ) for each user,
> the probability that any 2 users will agree on a concatenated hash key is
> S(ui,uj)^p  and thus  making the clusters more refined.
> S(ui,uj)  is the jaccard's coefficient  ( the  similarity coefficient )
>
>
> On Tue, Nov 8, 2011 at 12:20 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > From  MAHOUT-344 from the patch author:
> >
> > The idea behind keyGroups is to concatenate hashes from multiple hash
> > functions reduce the probability of collision between 2 users that agreed
> > on 1 or more individual hash values. This essentially improves the
> average
> > similarity of users in a cluster.
> >
> > -Grant
> >
> > On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote:
> >
> > > Do we have an answer for this?
> > >
> > > Sent from my iPhone
> > >
> > > On Nov 2, 2011, at 7:20 AM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> > >
> > >> What's the Minhash key groups value used for in the MinhashDriver?  I
> > mean, I see it is used for building up the key out of the hashed values,
> > but what's the significance of different values for it?  The default is
> 2,
> > what does it mean practically speaking if I choose, say, 10?  AFAICT, it
> > would mean that I would have more clusters, assuming that we still meet
> the
> > minimum cluster size imposed by the reducer?
> > >>
> > >> Thanks,
> > >> Grant
> >
> >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message