mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Map<Vector, ?>
Date Fri, 24 Feb 2012 01:26:18 GMT
On Thu, Feb 23, 2012 at 4:38 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> D,
>
> I think that the current idea on the table is to do essentially nothing
> except carefully examine and possibly remove cases of tables or sets with
> Vector keys.
>

Yes, that's what I'm suggesting.  If you grep through our code for
Map<Vector
or Map<DenseVector, or Map<SparseVector, - there are a few cases, and we
might want to see if any of these are sources of inefficiency in this sneaky
way.

  -jake


>
> What specifically are you arguing against?
>
> On Fri, Feb 24, 2012 at 12:25 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Right. Ok.
> >
> > just in case, here's my updated list of cons ...
> >
> > 1) if people explicitly happen to write a.equals(b) where they just as
> > easily could've written a==b if they were ok with identity comparison,
> > it is pretty surely means they expect equals-by-value and not identity
> > comparison.
> > 2) same goes for HashMap/IdentityHashMap (with much less obviousness
> > though).
> >
> > 3) We probably don't want to prohibit the use of vectors as MR map
> > output keys (although i don't immediately see use cases for it), that
> > means we have to support comparable-by-value, which would be
> > contractually incoherent with equals-by-identity at the same time.
> >
> > 4) What Ted said about precaching hash codes for Strings I seem to be
> > able to easily convert into another cons: apparently best practice
> > suggests to go extra step to optimize hash coding rather than to drop
> > equals-by-value, hashCode-by-value. There are techniques to improve
> > that (e.g. same identity comparison can be had since if a==b then it
> > also true that a.equals(b) also true. Hashcoding can be indeed cached
> > (with caveat that since vector really is not immutable, then it needs
> > to maintain lazy cashing and cache cancellation simiar to
> > GregorianCalendar stuff, but that's probably a hassle).
> >
> > I guess we can look into some hash coding optimization if it seems
> > compelling that we do so. well i guess Sean is saying that too.
> >
> > On Thu, Feb 23, 2012 at 2:31 PM, Sean Owen <srowen@gmail.com> wrote:
> > > I think equals() and hashCode() ought to be retained at least for
> > > consistency. There is no sin in implementing these per se. If
> hashCode()
> > is
> > > too slow in some context then it can be cached if it is so important.
> > This
> > > is what String does.
> > >
> > > However I doubt these should be used as keys in general - that is the
> > issue
> > > to be fixed if anything if there is a performance problem.
> > >
> > > Do you want people to use equals()? Dunno it's up to the caller really.
> > >
> > > Sean
> > > On Feb 23, 2012 5:25 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> > >
> > >> Hey Devs.
> > >>
> > >>  Was prototyping some stuff in Mahout last night, and noticed
> something
> > >> I'm not sure if we've talked about before: because we have equals()
> for
> > >> Vector instances return true iff the numeric values of the vectors are
> > >> equal, and we also have a consistent hashCode(), anytime you have
> > >> HashMap<Vector, Anything>, all the typical things you think are O(1)
> are
> > >> really O(vector.numNonZeroes()).  I tried to look through the codebase
> > and
> > >> see where we hang onto maps with vector keys, and we do it sometimes.
> > >>  Maybe we shouldn't?  Most Vectors have identities (clusterId,
> > documentId,
> > >> topicId, etc...) which we could normalize away... or maybe we should
> be
> > >> using IdentityHashMap, to ensure you're using strict object identity
> and
> > >> avoid doing this calculation?  This could be really slow if these are
> > big
> > >> dense vectors, for instance.
> > >>
> > >>  This looks like it could be a really easy place to accidentally add
> > heavy
> > >> complexity to things.  Do we really want people do be checking
> > >> *mathematical* equals() on vectors which have floating point
> precision?
> > >>
> > >>  -jake
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message