lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent" <Vincent.VanDenBer...@bvdinfo.com>
Subject RE: Bug in Lucene static initialization with multiple threads.
Date Thu, 23 Mar 2017 13:00:11 GMT
Hello Shad,

Let me make one more effort to convince you that the test is wrong: it was changed to its
"correct" version in version 4.9:

https://github.com/apache/lucene-solr/commit/c59f13f9918faeeb4e69acd41731e674ce88f912

(LUCENE-5623: fix test to be reproducible and more evil)

Ah well, I tried.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Thursday, March 23, 2017 1:28 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sorry for the confusion. You have made it to the right version tag, but it turns out there
are more than one TestEarlyTermination class. The one I am referring to is in the Misc subproject,
not in the Core project.

But I think that the answer to fixing that test lies somewhere deep in the core project (or
perhaps even in a subproject). It is quite possible that finding the answer to one of the
other tests will also fix TestEarlyTerminationDifferentSorter(). It's also possible the piece
of functionality that is broken has no tests of its own, which would explain why the issue
is so hard to find.

I tried making a non-random test for this scenario before, but I couldn't find the correct
set of input to make the error happen. The error happens rarely. But, the test does enough
random loops to make it happen frequently. I wish there were a tool in Lucene.Net that would
spit out the code for the exact same scenario that made the test fail, which would make it
far easier to debug. When I stepped through the code, sometimes it failed on the second iteration,
sometimes the third, sometimes the fourth, etc. But after it fails, it is nearly impossible
to figure out the data that was in the index and the random components that were plugged in
to make it fail.

So, we now have a big clue thanks to you - it will never fail on the first iteration. Hard
coding it to 2 iterations makes it fail roughly 50% of the time. Now I suppose it is just
a matter of going through and finding the random piece of input that makes it fail 100% of
the time.

I also just found another clue. When the "different sorter" circumstance happens, the EarlyTerminatingSortingCollector
changes its AcceptDocsOutOfOrder property from always being false to being the value of the
wrapped ICollector. This is the only difference between the test that always passes and the
one that randomly fails. So, either somebody is lying that they accept docs out of order (anybody
who returns true is suspect), or the code that accepts the docs out of order is munged.

Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, March 23, 2017 1:41 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Insanity is knowing the answer to life, the universe and everything, and suddenly realizing
that it's not 42.
Insanity is knowing that api-work is based on lucene 4.8.0 but yet it isn't. But it is. But
it isn't.

I don't understand it, but I must be living in a parallel universe. Or perhaps you are. Or
we both are. Maybe it's the matrix.

You refer to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131
as the code upon which the test is based, and I can't but acknowledge that it is so.

I do the following:
Go to https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java
On the page, click on the 'branch: master' button, click on 'tags', and choose 'release/lucene-solr/4.8.0'
The url switches to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java


"My" TestEarlyTermination.java is different from yours.

I give up. Stick a fork in me, I'm done.

Turn the lights of when you're leaving.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 6:58 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code.
If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you
can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change
in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in
Java does. Also, I have been seeing random errors from other places and in all cases the code
was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result
from the heap. That said, it is still too early to tell if the entirety of that issue was
due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue,
I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the
loop and be done with it or, as I had previously discovered, the test will pass if you just
put try-catch statements around the offending lines in TopDocsCollector and eat the null reference
exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure,
it is bad that in Java they didn't program defensively and check for a null before accessing
its members (bad Java programmer!), and maybe these tests were not written according to the
documentation, but the fact of the matter is Lucene is not supposed to throw this exception
in this case.

You have given me more to consider, though. I previously theorized that it was the part of
the code that is populating the queue that was failing, but it may be that there is a failure
to reset the "full" state somewhere during the next search, which is in turn causing this
failure to happen. It is strange that it only happens in the "different sorter" case, though
because the tests are nearly identical in every other sense.


> I'm getting something different from TestSearcherManager_Mem --> cannot close: there
are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting
a different result. So, I guess that makes me insane because that is exactly what these randomized
tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not
disposing stuff in the right order. It especially happens in the test environment when an
assert fails and then the Dispose() method doesn't get called as a result, then the run or
two attempt to recover. Sometimes changing the test to use using blocks works, but other times
it screws up the test. Of course you would always use using blocks in production code, but
you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses
this message and gets you to another one.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the
code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This
is done with a priority queue of at most numHits size. That priority queued keeps track of
the numHits (or less) "best matched" documents. If there are more than numHits total hits,
and there's a better matching document than the lowest matching document in the priority queue,
that lowest matching document in the queue is ousted and the better matching document is put
in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once
it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting
them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is
empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull,
numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from
the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot
be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to
call it more than once, passing each time a different start, you should call topDocs() and
work with the returned TopDocs object, which will contain all the results this search execution
collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java
you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter
is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields,
trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields,
trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2,
different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs
,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many
failing tests. I am not sure how I missed that one - I think I was staring at the screen for
too long. I moved the return true statement inside the try block, but I suppose it doesn't
really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()
(~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
(~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time -
down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to
happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem
to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter()
and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T>
somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread
class and fixed some minor differences from Lucene, but none of them affected the result.
I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager,
fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made,
and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274.
There are several things that depend on Automaton to work, and since none of them have failing
tests, I am not all that concerned about this "difference". But it would be nice to find an
answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate
any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning
"false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(),
e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error
message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue
(which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(),
which (both before and after adding these 3 lines) is returning a 0 result when it should
be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better
is certainly not out of the question. I mean, there would be no public API change if we had
to eliminate any or all of the private classes (including the ExecutorService<T>) in
favor of a more .NET-friendly implementation. As long as the external behavior is the same,
we don't much care how it is implemented internally. Would you be interested in providing
a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see
if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator
was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205,
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492.
This is one thing that made it tricky to work with because the original implementation only
allows a single read() and then the object is no longer available. This would be extremely
unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the
last entry is unavailable to the current thread in the case where 2 of them call MoveNext()
and one beats the other to read Current and reap in a race condition (it does this by throwing
an exception in Java). I will attempt to get it working this way, but at least now we can
settle for just passing instead of ideal behavior if we can't get it to work exactly this
way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to
the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem
is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that
the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each
iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect
result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible)
and return their value as soon as they become available. This is EXTREMELY hard to do, since
you will be essentially fighting a battle between your executor service and its execution
helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control
Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets
you up and running quickly. I think you might be attempting to push to the main repo - and
you won't have permission unless it is explicitly granted. What you need to do is to fork
the repo to your own GitHub account, then you can read/write it as much as necessary. Once
you get it to a point where you want to submit something, you can do a pull request (either
through GitHub or just manually email a request) and someone else can then review and merge
the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
problem - it always occurs when you call an overload of the IndexSearcher constructor that
takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test
framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778)
to happen rarely, which explains many of the random failures we are seeing. If you change
the random code to never use a TaskScheduler, the test will always pass, change it to always
use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs)
was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance
that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call
to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500)
and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would
be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically
TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something
to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by
doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second +
whatever time the first thread needed to end) and so on.  Given the amount of work of each
test thread, this is far too little time even on a fast machine. It takes 13 seconds for all
threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of
1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're
dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on
the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)).
This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any
current implementation, so ElementAt needs to skip "position" elements to get at the correct
one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change
the previous relative key order, which is incorrect in any current .NET implementation I am
aware of (dictionaries are hash tables with collision resolution through chaining, and reusing
of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator
directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator
is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V>
enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed.
Sadly, not all of these allocation can be eliminated, but you can delay the creation of the
keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe
it or not, but this is slightly more efficient for reasons I won't explain here since this
e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because
we call it through an interface (IDictionary<K,V>): it we had the actual object, no
enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead
of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose()
after you're finished with the instance. Note that in Lucene itself, foreach() is used which
does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations
of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do
to get rid of various synchronization objects. Right now this isn't the case, but you never
know what the future will bring. And besides, nice guys Dispose() after their enumeration
<g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync"
them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate
success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a
new data structure called WeakDictionary. Since there are already tests to verify concurrency
on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better
to get it working right than to try to fix this alternative version. I guess for the time
being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the
Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of
the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port
of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the
.NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java


The complicated part is getting it to "reap" the elements in a thread-safe way so the counts
are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap
thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work,
so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues
you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning:
using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread:
it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for
Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may
not execute.

I could not understand what was going on, especially because it only occurred "sometimes".
It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type,
System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator()
Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean()
Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded()
Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type
key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>()
Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>()
Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>()
Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields()
Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream()
Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer
analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[]
fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder
fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287      
         C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574      
         C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term
term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer
analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        35444    11           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        44124    12           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged        >             44140    13           Worker Thread <No Name>  
             Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        47700    14           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        28168    15           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        30988    16           Worker Thread <No Name>   
            Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal
Not Flagged                        21828    6              Worker Thread <No Name> 
              Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean
               Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same
collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if
(currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll
get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static
initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new
(some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which
one. It's meaningless, but it has the side effect of initializing the static fields without
problems.


Vincent


Mime
View raw message