lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Albrecht <o.albre...@oliver-albrecht.de>
Subject RE: [LUCENENET-594] Stackoverflow exception
Date Tue, 15 Aug 2017 13:12:11 GMT
Hello Shad,

thanks for your afford.

The content field of the index is the content of several thousands files in the DMS part
of our ERP solution. These are mainly letters, invoices, offers and notes. Nothing special.
I think the problem here is, that the content of many files are very similar. I don't know
(and as an user i shouldn't know) how the index internally is built, but i think it's a kind
of tree and because of the similar content, the tree is to "deep" for recursion.

So i think that the DocumentDictionary.GetEntryIterator is not designed to work with the Suggester
at this time. But what is the purpose DocumentDictionary if it can't be used in this way?
If i can not use the content of the index to feed the AnalyzingSuggester the whole concept
of the suggester is useless for me. What i have in mind to build is something like the google
search which present you suggestions what you could search.

kind regards

Oliver

> Shad Storhaug <shad@shadstorhaug.com> hat am 15. August 2017 um 14:29 geschrieben:
> 
> Oliver,
> 
> I have created a new issue so we can track this on JIRA: https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-594
> 
> I have confirmed the same behavior in Java Lucene with the following test:
> 
> [Test, LuceneNetSpecific]
>  public void TestLUCENENET594()
>  {
>  // Rather than relying on a file path somewhere, we store the
>  // files zipped in an embedded resource and unzip them to a
>  // known temp directory for the test.
>  DirectoryInfo indexDir = CreateTempDir("test-lucenenet-594");
>  using (var stream = GetType().getResourceAsStream("LUCENENET-594.zip"))
>  {
>  TestUtil.Unzip(stream, indexDir);
>  }
> 
> AnalyzingSuggester suggester = new AnalyzingSuggester(new GermanAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48));
> 
> Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.Open(indexDir);
>  IndexReader ir = DirectoryReader.Open(dir);
>  DocumentDictionary dict = new DocumentDictionary(ir, "Content", null, null);
> 
> IInputIterator iter = dict.GetEntryIterator();
>  suggester.Build(iter); // Throws stackoverflow exception
>  }
> 
> Converted to Java:
> 
> public void testLUCENENET594() throws Exception
>  {
>  // Rather than relying on a file path somewhere, we store the
>  // files zipped in an embedded resource and unzip them to a
>  // known temp directory for the test.
>  File indexDir = createTempDir("test-lucenenet-594");
>  File file = new File(getClass().getResource("LUCENENET-594.zip").toURI());
>  TestUtil.unzip(file, indexDir);
> 
> AnalyzingSuggester suggester = new AnalyzingSuggester(new org.apache.lucene.analysis.de.GermanAnalyzer(org.apache.lucene.util.Version.LUCENE_48));
> 
> org.apache.lucene.store.Directory dir = org.apache.lucene.store.FSDirectory.open(indexDir);
>  org.apache.lucene.index.IndexReader ir = org.apache.lucene.index.DirectoryReader.open(dir);
>  org.apache.lucene.search.suggest.DocumentDictionary dict = new org.apache.lucene.search.suggest.DocumentDictionary(ir,
"Content", null, null);
> 
> org.apache.lucene.search.suggest.InputIterator iter = dict.getEntryIterator();
>  suggester.build(iter); // Throws stackoverflow exception
>  }
> 
> Both tests use the attached zip file, LUCENENET-594.zip as an embedded resource.
> 
> I can only conclude that the data in the index is invalid in some way or it is not valid
to use the result of DocumentDictionary.GetEntryIterator() in conjunction with AnalyzingSuggester.Build().
> 
> Do note that the Automaton functionality was intentionally made recursive (https://issues.apache.org/jira/browse/LUCENE-6156),
and since it is based on a regular expression, inputs that cause too many matches can overflow
the call stack.
> 
> There is some information online about how to use the AnalyzingSuggester:
> 
> http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-suggester.html
> http://www.programcreek.com/java-api-examples/index.php?api=org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester
> http://puneetkhanal.blogspot.com/2013/04/simple-auto-suggester-using-lucene-41.html
> 
> You might also try analyzing the tests to determine the correct usage (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests.Suggest/Suggest/Analyzing/AnalyzingSuggesterTest.cs).
> 
> None of these examples use the DocumentDictionary.GetEntryIterator() in conjunction with
AnalyzingSuggester.Build(). I suspect that they either weren't designed to be used together
or the data in your "Content" field isn't what is expected as valid input.
> 
> I suggest you add the details of how you are creating the index and what usage example(s)
you are following to the JIRA issue (https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-594)
so we can try to work out whether there is something wrong with the input data or the usage
of AnalyzingSuggester is incorrect.
> 
> Thanks,
> Shad Storhaug (NightOwl888)
> 
> -----Original Message-----
> From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> Sent: Tuesday, August 8, 2017 6:29 PM
> To: user@lucenenet.apache.org
> Subject: RE: Stackoverflow exception
> 
> Hello Shad,
> 
> i can confirm that the bug in the CI build still exists.
> 
> The code to reproduce the issue was in my inital mail.
> 
> I think that the problem is not the code to reproduce but the index data needed.
> 
> I could build a test program and send it to you along with the index data directory.
So you could debug for yourself.
> 
> I would send you a link to the resulting zip on my google drive to your e-mail-address.
I'm not allowed to share the data to the public.
> 
> kind regards
> 
> Oliver
> 
> > Shad Storhaug <shad@shadstorhaug.com> hat am 8. August 2017 um 11:30 geschrieben:
> > 
> > Oliver,
> > 
> > There have been a lot of bugs fixed since the last beta release. It would be useful
to know whether the bug still exists in the latest CI build so we aren't spending time working
on bugs that have already been fixed. It shouldn't take you long to swap out the packages
just to test out whether the problem is still present.
> > 
> > If the problem still exists in the CI build, please provide us with the minimal
code to reproduce it. A console application that reproduces the issue would be fine, but it
would be ideal if you provide a pull request on GitHub (https://github.com/apache/lucenenet/pulls)
with a test in the AnalyzingSuggesterTest class (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests.Suggest/Suggest/Analyzing/AnalyzingSuggesterTest.cs)
that fails with the issue (and mark it with the [LuceneNetSpecific] attribute) so we can ensure
the bug stays fixed throughout future porting efforts. Once we have a test, we can port it
back to Java and step through to find out where the execution paths diverge. Alternatively,
if you are willing to do the work of comparing with Java to find out where the problem is,
you could submit a PR containing both the test and the fix for it.
> > 
> > Despite the fact there are nearly 8000 tests, there are some dusty corners that
are not covered and I think you may have stumbled upon one of them. And no, you are the first
to report this issue to us.
> > 
> > All I can tell you is that the download counts on the new NuGet packages such as
Lucene.Net.Suggest, Lucene.Net.Highlighter, Lucene.Net.Facet, and Lucene.Net.Spatial are much
lower than I would expect them to be for a beta, and it would be extremely helpful if there
were some people dedicated to finding bugs in these packages and reporting them to us.
> > 
> > Thanks,
> > Shad Storhaug (NightOwl888)
> > 
> > -----Original Message-----
> > From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> > Sent: Tuesday, August 8, 2017 3:08 PM
> > To: user@lucenenet.apache.org
> > Subject: RE: Stackoverflow exception
> > 
> > Hello Itamar, hello Shad,
> > 
> > thanks for the fast response.
> > 
> > I could you provide the callstack, but i think it's useless because it's full of
calls to IsFinite(State s, OpenBitSet path, OpenBitSet visited) because it is a recursive
function. For the same reason i think that using the CI-Build wouldn't change anything. The
IsFinite-function is still working recursively.
> > 
> > I've tried to replace the recursion with a stack based approach (using Stack), but
i'm not sure if my implementation is correct.
> > 
> > How ever, if i use my non-recursive version of IsFinite it crashes with a stackoverflow
exception in GetFiniteStrings(State s, HashSet pathstates, HashSet strings, Int32sRef path,
int limit), which is also a recursive function. But this function is to complex for me to
convert it into a non-recursive version without exact knowledge what the function should do.
> > 
> > In my opinion is the replacement of the recursion with a non-recursive approach
the only solution. Does no one else have this problem? I think to have an index with 4000
documents and a size with 15 MB is not so extraordinary. Or is this only a problem how the
suggester works?
> > 
> > I'm just try to use lucene to build a fulltext query engine for our internal dms
system. The system holds currently 450.000 documents with ca. 50 GB of data. I think the final
index will be around 2 GB of size.
> > 
> > kind regards
> > 
> > Oliver
> > 
> > > Shad Storhaug <shad@shadstorhaug.com> hat am 7. August 2017 um 19:06
geschrieben:
> > > 
> > > Hi Oliver,
> > > 
> > > In addition to providing the full stack trace that Itamar mentioned,
> > > could you confirm the problem still exists if you use the latest CI
> > > build here:
> > > [https://www.myget.org/gallery/lucene-net-ci?](https://www.myget.org
> > > /g
> > > allery/lucene-net-ci)
> > > 
> > > Thanks,
> > > Shad Storhaug (NightOwl888)
> > > 
> > > -----Original Message-----
> > > From: itamar.synhershko@gmail.com
> > > [mailto:itamar.synhershko@gmail.com]
> > > On Behalf Of Itamar Syn-Hershko
> > > Sent: Monday, August 7, 2017 9:06 PM
> > > To: user@lucenenet.apache.org
> > > Subject: Re: Stackoverflow exception
> > > 
> > > What is the full stacktrace please?
> > > 
> > > --
> > > 
> > > Itamar Syn-Hershko
> > > Freelance Developer & Consultant
> > > Elasticsearch Partner
> > > Microsoft MVP | Lucene.NET PMC
> > > http://code972.com | @synhershko <https://twitter.com/synhershko>
> > > http://BigDataBoutique.co.il/
> > > 
> > > On Mon, Aug 7, 2017 at 4:49 PM, Oliver Albrecht < o.albrecht@oliver-albrecht.de>
wrote:
> > > 
> > > > Hello,
> > > > 
> > > > i'm using a DocumentDictionary to feed an AnalyzingSuggester using
> > > > the following code:
> > > > 
> > > > SnippetAnalyzingSuggester suggester = new AnalyzingSuggester(new
> > > > GermanAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48));
> > > > 
> > > > Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.
> > > > Open(indexDir);
> > > > 
> > > > IndexReader ir = DirectoryReader.Open(dir);
> > > > 
> > > > DocumentDictionary dict = new DocumentDictionary(ir, "Content",
> > > > null, null);
> > > > 
> > > > suggester.Build(dict.GetEntryIterator());
> > > > 
> > > > I get a stackoverflow exception on suggester.Build. The exception
> > > > throws in Lucene.Net.Util.Automaton.SpecialOperations.IsFinite.
> > > > 
> > > > The index contains 10.000 documents and has no payload and no weight.
> > > > 
> > > > Kind regards
> > > > 
> > > > Oliver

Mime
View raw message