mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Retrieving labels for indexes?
Date Tue, 08 Dec 2009 20:50:13 GMT
For columns of a row-based matrix, I'm down with hashing or whatever.  For
the rows on such matrices, inverting this is sometimes necessary (as Sean's
case shows).  I'd hate to have an api with long row indexes and int column
indices though, that would be unacceptable.

  -jake

On Tue, Dec 8, 2009 at 11:10 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Systems like Vowpal Wabbit already support billions (and more) features,
> but
> they do it with the hashing trick and deal with possible collisions by
> multiple hashing.  They claim support for as many as 10^12 features.
>
> As long as it is possible to avoid the overhead, I would be +0.  If the
> overhead applies to all tasks then I would be -1.
>
> Scalability is quite possible without this.
>
> On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gsingers@apache.org>
> wrote:
>
> > How hard would it be to transparently support both?  Could we have one
> > implementation for "smaller" problems and one for larger?
> >
> > At any rate, +1 to making this be available for really large scale.
> >
> > -Grant
> >
> > On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
> >
> > > I'm sure it's not hard. It makes (sparse) vectors consume that much
> > > more memory though.
> > >
> > > This change would certainly help my case, but I already have a bit of
> > > a workaround: I hash longs into ints and store the reverse mapping.
> > > There is possibility of collision but the consequence is small in the
> > > context of collaborative filtering.
> > >
> > > I suppose if I'm the only use case that would benefit at the moment,
> > > maybe not worth it, but if you can think of other reasons, let's
> > > change.
> > >
> > > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> > >> This brings up a point about our linear primitives: are 32bit integers
> > big
> > >> enough for our index range for vectors and matrices?  Especially for
> > >> matrices,
> > >> having billions of rows is completely possible, even if it is on the
> > large
> > >> side.
> > >>
> > >> If we want to be about "scalable" machine learning, we really don't
> want
> > to
> > >> seal ourselves in to "only" 2 billion x 2 billion matrices in the long
> > run,
> > >> do we?
> > >>
> > >> How hard would it be to promote our ints to longs?
> > >>
> > >>  -jake
> > >>
> > >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <srowen@gmail.com> wrote:
> > >>
> > >>> I'm trying to use Vectors to represent a vector of user preferences.
> > >>> All is well since items are numeric and can be used as indexes into
a
> > >>> Vector -- almost. I have longs, and of course indexes are ints.
> > >>>
> > >>> I could fold the long IDs into ints without too much worry about the
> > >>> effects of collision. However I still need to remember the original
> > >>> item IDs for each index. I could do it with labels, but I can't
> > >>> retrieve the label for an index (and the other mapping isn't
> > >>> serialized anyway?).
> > >>>
> > >>> So I guess I must separately store this mapping? Just making sure I'm
> > >>> not missing something.
> > >>>
> > >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message