mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Odd vector iteration behavior
Date Mon, 15 Apr 2013 19:08:27 GMT
On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <robin.anil@gmail.com> wrote:

> This is what I propose:
>
> 1) Allow setting value to zero while iterating (e.set(0.0)).
>

This is in addition to the fact that we already allow setting nonzero values
while iterating, right?


> 2) Do not allow callers to use vector.set(index, 0.0) during iterating).
> This can cause re-hashing. (Can set a dirty bit in the hashmap during
> rehash to throw a concurrent modified exception)
>

Agreed - this is a commonly accepted requirement: I think in fact we
should pro-actively throw ConcurrentModificationException if someone
tries to call vector.set / vector.assign while iterating.


> 3) Update the numNonDefaultElements to iterate over the array to discount
> 0.0 instead of returning the hashMap values.
> 4) IterateNonZero may iterate over a few zeros if you did set the dimension
> to 0. Most of the statistics code should handle 0 values correctly.
>

Yeah, are we really strict about getNumNonDefaultElements really always
returning exactly the number of nonzeroes?  I was under the impression that
for e.g. DenseVector, it would give the overal size, even if some were 0,
and that it was basically tracking the amount of space the vector was taking
up.  But I can see the argument that it really should return what it says it
returns, if that is relied upon.


>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > Ah, this was the one corner case I was worried about - we do special-case
> > setting to 0,
> > as meaning remove from the hashmap, yes.
> >
> > What's the TL;DR of what you did to work around this?  Should we allow
> > this?  Even
> > if it's through the Vector.Element instance, should it be ok?  If so, how
> > to handle?
> >
> >
> > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <robin.anil@gmail.com>
> wrote:
> >
> > > I am adding the tests and updating the patch.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <robin.anil@gmail.com>
> > wrote:
> > >
> > > > You can re-iterate if the state is in iteration. But you cannot
> write.
> > > >
> > > > This is what is happening:
> > > >
> > > > One of the values are becoming 0. So Vector tries to remove it from
> the
> > > > underlying hashmap. This changes the layout, if a vector has to be
> > > mutated
> > > > while iterating, we have to set 0 value in the hashmap and not remove
> > it
> > > > like what the Vector layer is doing. This adds another complexity,
> the
> > > > vector iterator has to deal with skipping over elements with 0 value.
> > > >
> > > >
> > > > Try this
> > > >
> > > > Create a vector of length 13 and set the following values.
> > > >
> > > >
> > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1,
> > 1,
> > > >    2, 1 };
> > > >    2.     for (int i = 0; i < val.length; ++i) {
> > > >    3.       vector.set(i, val[i]);
> > > >    4.     }
> > > >
> > > > Iterate again and while iterating set one of the values as zero.
> > > >
> > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > > dangeorge.filimon@gmail.com
> > > > > wrote:
> > > >
> > > >> What kind of Vector is failing to set() in that code?
> > > >>
> > > >> About the state enum, what if (for whatever reason, not
> > > >> multi-threaded-ness) there are multiple iterators to that vector?
> > > >> Something like a reference count (how many iterators point to it)
> > would
> > > >> probably be needed, and keeping it sane would only be possible in
> one
> > > >> thread. Although this seems kind of brittle.
> > > >>
> > > >> +1 for numNonDefault.
> > > >>
> > > >>
> > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <robin.anil@gmail.com>
> > > wrote:
> > > >>
> > > >>> Another behavior difference.
> > > >>>
> > > >>> The numNonDefaultElement for a DenseVector returns the total
> length.
> > > >>> This causes Pearson Correlation Similarity to differ from if it
was
> > > >>> implemented using on of the SparseVector.
> > > >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> > > over
> > > >>> the dense vector to figure out non zero values ? Sounds ok
> > > >>>
> > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >>>
> > > >>>
> > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <robin.anil@gmail.com
> > > >wrote:
> > > >>>
> > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate
> the
> > > >>>> object while iterating.
> > > >>>>
> > > >>>>
> > > >>>>    1.     while (it.hasNext()) {
> > > >>>>    2.       Vector.Element e = it.next();
> > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > > >>>>    4.     }
> > > >>>>
> > > >>>> This has a side effect of causing the underlying hash-map
or
> object
> > to
> > > >>>> change.
> > > >>>>
> > > >>>> The right behavior is to set the value of the index while
> iterating.
> > > >>>>
> > > >>>>    1.     while (it.hasNext()) {
> > > >>>>    2.       Vector.Element e = it.next();
> > > >>>>    3.       *e.set(e.get()* - average);
> > > >>>>    4.     }
> > > >>>>
> > > >>>> I am sure we are incorrectly doing the first style across
the code
> > at
> > > >>>> many places.
> > > >>>>
> > > >>>> I am proposing this
> > > >>>>
> > > >>>> When iterating, we lock the set interface on the vector using
a
> > State
> > > >>>> enum. If anyone tries to mutate, we throw an exception.
> > > >>>> We flip the state when we complete iterating (hasNext = false)
or
> > when
> > > >>>> we explicitly close the iterator (adding a close method on
the
> > > iterator).
> > > >>>>
> > > >>>> Again this is all a single thread fix. if a vector is being
> mutated
> > > and
> > > >>>> iterated across multiple threads, all hell can break loose.
> > > >>>>
> > > >>>> Robin
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
> robin.anil@gmail.com
> > > >wrote:
> > > >>>>
> > > >>>>> Spoke too soon still failure.  I am uploading the latest
patch.
> > These
> > > >>>>> are the current failing tests.
> > > >>>>>
> > > >>>>>
> > >
> >
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > >>>>> null
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > >>>>> null
> > > >>>>>
> > > >>>>>
> > >
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> > > >>>>>
> > > >>>>>
> > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
Inc.
> > > >>>>>
> > > >>>>>
> > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
> robin.anil@gmail.com
> > > >wrote:
> > > >>>>>
> > > >>>>>> Found it, fixed it. I am submitting soon.
> > > >>>>>>
> > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
Google Inc.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > > ted.dunning@gmail.com>wrote:
> > > >>>>>>
> > > >>>>>>> Robin,
> > > >>>>>>>
> > > >>>>>>> Can you make sure that the patches are somewhere
that Dan can
> > pick
> > > >>>>>>> up this
> > > >>>>>>> work?  He is in GMT+2 and is probably about to
appear on the
> > scene.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> > robin.anil@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> > Strike that there are still failures. Investigating.
if I
> cant
> > > fix
> > > >>>>>>> it in
> > > >>>>>>> > the next hour, I will submit them sometime
in the evening
> > > tomorrow.
> > > >>>>>>> >
> > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869
2602 | Google
> Inc.
> > > >>>>>>> >
> > > >>>>>>> >
> > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil
<
> > > robin.anil@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>> >
> > > >>>>>>> > > Tests pass. Submitting the patches.
> > > >>>>>>> > >
> > > >>>>>>> > > Robin Anil | Software Engineer | +1
312 869 2602 | Google
> > Inc.
> > > >>>>>>> > >
> > > >>>>>>> > >
> > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin
Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > wrote:
> > > >>>>>>> > >
> > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException
like
> > Java
> > > >>>>>>> > >> Collections when iterating past
the end. Things look
> solid,
> > > >>>>>>> performance
> > > >>>>>>> > is
> > > >>>>>>> > >> 2x. All Math tests pass. I am now
waiting for the entire
> > test
> > > >>>>>>> suites to
> > > >>>>>>> > run
> > > >>>>>>> > >> before submitting.
> > > >>>>>>> > >>
> > > >>>>>>> > >> Robin Anil | Software Engineer |
+1 312 869 2602 | Google
> > > Inc.
> > > >>>>>>> > >>
> > > >>>>>>> > >>
> > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM,
Robin Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > wrote:
> > > >>>>>>> > >>
> > > >>>>>>> > >>> I am not sure what I did. But
removing Guava Abstract
> > > iterator
> > > >>>>>>> actually
> > > >>>>>>> > >>> sped up the dot, cosine, euclidean
by another 60%. Things
> > are
> > > >>>>>>> now 2x
> > > >>>>>>> > faster
> > > >>>>>>> > >>> than trunk. While also correcting
the behavior (I hope)
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> >
> > > >>>>>>>
> > >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > > >>>>>>> > >>>
> > > >>>>>>> > >>> Robin Anil | Software Engineer
| +1 312 869 2602 |
> Google
> > > Inc.
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56
PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com
> > > >>>>>>> > >wrote:
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>> Also note that this is code
gen, I have to create
> > > >>>>>>> > Element$keyType$Value
> > > >>>>>>> > >>>> for each and every combination
not just int double. and
> > also
> > > >>>>>>> update
> > > >>>>>>> > all
> > > >>>>>>> > >>>> callers to user ElementIntDouble
instead of Element. Is
> it
> > > >>>>>>> worth it ?
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>> Robin Anil | Software Engineer
| +1 312 869 2602 |
> Google
> > > >>>>>>> Inc.
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at
8:46 PM, Ted Dunning <
> > > >>>>>>> ted.dunning@gmail.com
> > > >>>>>>> > >wrote:
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>> Collections (no longer
colt collections) are now part
> of
> > > >>>>>>> mahout math.
> > > >>>>>>> > >>>>>  No
> > > >>>>>>> > >>>>> need to keep them separate.
 The lower iterator can
> > > reference
> > > >>>>>>> > >>>>> Vector.Element
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>> On Sun, Apr 14, 2013
at 6:24 PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > >>>>> wrote:
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>> > I would have loved
to but Element is a sub interface
> in
> > > >>>>>>> Vector. If
> > > >>>>>>> > >>>>> we want
> > > >>>>>>> > >>>>> > to keep colt collections
separate we have to keep
> this
> > > >>>>>>> separation.
> > > >>>>>>> > >>>>> >
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>
> > > >>>>>>> > >
> > > >>>>>>> >
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message