lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Albrecht <o.albre...@oliver-albrecht.de>
Subject RE: [LUCENENET-594] Stackoverflow exception
Date Wed, 16 Aug 2017 12:15:21 GMT
Hello Shad,

thanks for the hints. I will take a look.

kind regards

Oliver

> Shad Storhaug <shad@shadstorhaug.com> hat am 15. August 2017 um 20:01 geschrieben:
> 
> Hi Oliver,
> 
> > And yes, the content field can be empty.
> 
> This seems to conflict with the way the DocumentInputIterator data is processed. The
data field is skipped if it is null, it is processed as a "real" entry if it is an empty string
(https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Suggest/Suggest/DocumentDictionary.cs#L197).
I have confirmed this is the way it is supposed to work in Lucene, but even if the logic were
changed to string.IsNullOrEmpty() instead of checking for just null, there is still another
problem causing the call stack to overflow.
> 
> From the DocumentDictionary docs: " if any of the term or (optionally) payload fields
supplied do not have a value for a document, then the document is skipped by the dictionary".
Clearly an empty string means it "has a value".
> 
> This article (https://www.norconex.com/serving-autocomplete-suggestions-fast/) seems
to have some more information about how the AnalyzingSuggester works, and Lucene 4 Cookbook
has some more tests demonstrating the basics (https://github.com/edng/lucene4_cookbook_examples/blob/master/src/main/java/org/edng/lucene4/example/AutoSuggestTest.java)
- you might also want to check out the book.
> 
> FYI - I tried replacing the DocumentDictionary with LuceneDictionary as is in the Lucene
4 Cookbook test, and it seems to make your example work. I can't be sure because I can't read
German and I am not sure what your requirements are, but typing "cr" brings up a list of words
that start with "cr".
> 
> IDictionary dictionary = new LuceneDictionary(indexReader, "Content");
> 
> Do note, if you are looking for multi-term matches that are not necessarily from the
beginning of the phrase, then AnalyzingInfixSuggester sounds like it may be a better fit:
https://stackoverflow.com/a/19315507, which is more along the lines of what Google does.
> 
> Also, it looks like I was completely wrong about DocumentDictionary not being compatible
with AnalyzingSuggester (from http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-suggester.html):
> 
> > If you have suggestions in your index, as e.g. text and weight etc. as stored fields
in your documents, you can use the DocumentDictionary class to enumerate the suggestions from
your documents. You pass that to AnalyzingSuggester.build to build the suggester.
> 
> That said, it doesn't sound like you are feeding "suggestions" to the DocumentDictionary,
but raw unprocessed text, which is probably why it isn't working for you. It stands to reason
there is a realistic string length limit that is being exceeded by including the text of whole
letters and invoices in this field, since all of the tests show this to be the "suggested
text" of what is typed. Using more than 7 or 8 words for a search term suggestion seems excessive.
> 
> Thanks,
> Shad Storhaug (NightOwl888)
> 
> -----Original Message-----
> From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> Sent: Tuesday, August 15, 2017 9:51 PM
> To: user@lucenenet.apache.org
> Subject: RE: [LUCENENET-594] Stackoverflow exception
> 
> Hello Shad,
> 
> yes, the content field is stored in the index. I've first used a version where the content
is not stored, but the AnalyzingSuggester always return no result if i build a DocumentDictionary
over the content field and then perform a lookup. So i tried to store the content and this
is were the exception on building the DocumentDictionary happens.
> 
> And yes, the content field can be empty. We have several document types in our DMS where
no real file exists. The data is stored in the metadata fields in the database. Because we
want completely replace the database-internal search with lucene, we must also store the DMS-entries
without files in the index. We are storing the file content along with 50+ other metadata-fields
in the index.
> 
> Here is the function i use to write to the index:
> 
> private string EnsureNotNull(string input)
>  {
>  return input ?? string.Empty;
>  }
> 
> public override void IndexDocument(DMSDocument document)
>  {
>  if (document == null)
>  return;
>  var doc = new Document
>  {
>  new StoredField(DMSDocument.IdField,document.Id),
>  new StoredField(DMSDocument.PathField, EnsureNotNull(document.Path)),
>  new TextField(DMSDocument.ContentField,EnsureNotNull(document.Content), Field.Store.YES),
>  new TextField(DMSDocument.DATEINAMEField,EnsureNotNull(document.DATEINAME),Field.Store.NO),
>  new TextField(DMSDocument.KAT1Field,EnsureNotNull(document.KAT1),Field.Store.NO),
>  new TextField(DMSDocument.KAT2Field,EnsureNotNull(document.KAT2),Field.Store.NO),
>  new TextField(DMSDocument.KAT3Field,EnsureNotNull(document.KAT3),Field.Store.NO),
>  new TextField(DMSDocument.CREATEUSERField,EnsureNotNull(document.CREATEUSER),Field.Store.NO),
>  new TextField(DMSDocument.CREATE_DTField,EnsureNotNull(document.CREATE_DT),Field.Store.NO),
>  new TextField(DMSDocument.MODIFYUSERField,EnsureNotNull(document.MODIFYUSER),Field.Store.NO),
>  new TextField(DMSDocument.MODIFY_DTField,EnsureNotNull(document.MODIFY_DT),Field.Store.NO),
>  new TextField(DMSDocument.FILE_EXTField,EnsureNotNull(document.FILE_EXT),Field.Store.NO),
>  new TextField(DMSDocument.DOKUMENTENNRField,EnsureNotNull(document.DOKUMENTENNR),Field.Store.NO),
>  new TextField(DMSDocument.WVField,EnsureNotNull(document.WV),Field.Store.NO),
>  new TextField(DMSDocument.WVSTATUSField,EnsureNotNull(document.WVSTATUS),Field.Store.NO),
>  new TextField(DMSDocument.WVDATEField,EnsureNotNull(document.WVDATE),Field.Store.NO),
>  new TextField(DMSDocument.VERSANDDATUMField,EnsureNotNull(document.VERSANDDATUM),Field.Store.NO),
>  new TextField(DMSDocument.VERSANDARTField,EnsureNotNull(document.VERSANDART),Field.Store.NO),
>  new TextField(DMSDocument.APField,EnsureNotNull(document.AP),Field.Store.NO),
>  new TextField(DMSDocument.AP_BEZField,EnsureNotNull(document.AP_BEZ),Field.Store.NO),
>  new TextField(DMSDocument.AP_TELField,EnsureNotNull(document.AP_TEL),Field.Store.NO),
>  new TextField(DMSDocument.AP_FAXField,EnsureNotNull(document.AP_FAX),Field.Store.NO),
>  new TextField(DMSDocument.AP_MAILField,EnsureNotNull(document.AP_MAIL),Field.Store.NO),
>  new TextField(DMSDocument.AUFTRAGGEBERField,EnsureNotNull(document.AUFTRAGGEBER),Field.Store.NO),
>  new TextField(DMSDocument.PROJEKTField,EnsureNotNull(document.PROJEKT),Field.Store.NO),
>  new TextField(DMSDocument.OBJEKTField,EnsureNotNull(document.OBJEKT),Field.Store.NO),
>  new TextField(DMSDocument.LVField,EnsureNotNull(document.LV),Field.Store.NO),
>  new TextField(DMSDocument.INFOField,EnsureNotNull(document.INFO),Field.Store.NO),
>  new TextField(DMSDocument.ADRESSFELDField,EnsureNotNull(document.ADRESSFELD),Field.Store.NO),
>  new TextField(DMSDocument.BETREFFField,EnsureNotNull(document.BETREFF),Field.Store.NO),
>  new TextField(DMSDocument.BEZEICHNUNGField,EnsureNotNull(document.BEZEICHNUNG),Field.Store.NO),
>  new TextField(DMSDocument.BEDIENERField,EnsureNotNull(document.BEDIENER),Field.Store.NO),
>  new TextField(DMSDocument.SACHBEARBField,EnsureNotNull(document.SACHBEARB),Field.Store.NO),
>  new TextField(DMSDocument.SACHBEARB_TELField,EnsureNotNull(document.SACHBEARB_TEL),Field.Store.NO),
>  new TextField(DMSDocument.SL_KATALOGField,EnsureNotNull(document.SL_KATALOG),Field.Store.NO),
>  new TextField(DMSDocument.SL_LEISTUNGSNRField,EnsureNotNull(document.SL_LEISTUNGSNR),Field.Store.NO),
>  new TextField(DMSDocument.SL_KURZBEZField,EnsureNotNull(document.SL_KURZBEZ),Field.Store.NO),
>  new TextField(DMSDocument.SL_SELEKTIONField,EnsureNotNull(document.SL_SELEKTION),Field.Store.NO),
>  new TextField(DMSDocument.GE_NUMMERField,EnsureNotNull(document.GE_NUMMER),Field.Store.NO),
>  new TextField(DMSDocument.GE_KURZBEZField,EnsureNotNull(document.GE_KURZBEZ),Field.Store.NO),
>  new TextField(DMSDocument.GE_BEZField,EnsureNotNull(document.GE_BEZ),Field.Store.NO),
>  new TextField(DMSDocument.GE_FABRIKATField,EnsureNotNull(document.GE_FABRIKAT),Field.Store.NO),
>  new TextField(DMSDocument.GE_KENNZEICHENField,EnsureNotNull(document.GE_KENNZEICHEN),Field.Store.NO),
>  new TextField(DMSDocument.EMPFAENGERField,EnsureNotNull(document.EMPFAENGER),Field.Store.NO),
>  new TextField(DMSDocument.SenderNameField,EnsureNotNull(document.SenderName),Field.Store.NO),
>  new TextField(DMSDocument.SenderMailField,EnsureNotNull(document.SenderMail),Field.Store.NO),
>  };
>  this.IndexWriter.Value.UpdateDocument(new Term(DMSDocument.IdField, document.Id), doc);
>  }
> 
> Nothing special, i think. The DMSDocument.XXXXField are constants for the field name.
> 
> kind regards
> 
> Oliver
> 
> > Shad Storhaug <shad@shadstorhaug.com> hat am 15. August 2017 um 16:06 geschrieben:
> > 
> > Oliver,
> > 
> > When I was stepping through the code, I noticed that the "Content" field is returning
an empty string (at least for the first few documents) when calling Document.GetField("Content").GetStringValue().
Are you using the Field.Store.YES option when writing your "Content" field so it can be read
back? Even if that is not exactly the problem, I suspect this has more to do with how the
data is written to the index than how it is being read.
> > 
> > Some code or pseudo-code demonstrating how the index is being created would be helpful.
> > 
> > Thanks,
> > Shad Storhaug (NightOwl888)
> > 
> > -----Original Message-----
> > From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> > Sent: Tuesday, August 15, 2017 8:12 PM
> > To: user@lucenenet.apache.org
> > Subject: RE: [LUCENENET-594] Stackoverflow exception
> > 
> > Hello Shad,
> > 
> > thanks for your afford.
> > 
> > The content field of the index is the content of several thousands files in the
DMS part of our ERP solution. These are mainly letters, invoices, offers and notes. Nothing
special. I think the problem here is, that the content of many files are very similar. I don't
know (and as an user i shouldn't know) how the index internally is built, but i think it's
a kind of tree and because of the similar content, the tree is to "deep" for recursion.
> > 
> > So i think that the DocumentDictionary.GetEntryIterator is not designed to work
with the Suggester at this time. But what is the purpose DocumentDictionary if it can't be
used in this way? If i can not use the content of the index to feed the AnalyzingSuggester
the whole concept of the suggester is useless for me. What i have in mind to build is something
like the google search which present you suggestions what you could search.
> > 
> > kind regards
> > 
> > Oliver
> > 
> > > Shad Storhaug <shad@shadstorhaug.com> hat am 15. August 2017 um 14:29
geschrieben:
> > > 
> > > Oliver,
> > > 
> > > I have created a new issue so we can track this on JIRA:
> > > https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-5
> > > 94
> > > 
> > > I have confirmed the same behavior in Java Lucene with the following test:
> > > 
> > > [Test, LuceneNetSpecific]
> > >  public void TestLUCENENET594()
> > >  {
> > >  // Rather than relying on a file path somewhere, we store the //
> > > files zipped in an embedded resource and unzip them to a // known
> > > temp directory for the test.
> > >  DirectoryInfo indexDir = CreateTempDir("test-lucenenet-594");
> > >  using (var stream =
> > > GetType().getResourceAsStream("LUCENENET-594.zip"))
> > >  {
> > >  TestUtil.Unzip(stream, indexDir);
> > >  }
> > > 
> > > AnalyzingSuggester suggester = new AnalyzingSuggester(new
> > > GermanAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48));
> > > 
> > > Lucene.Net.Store.Directory dir =
> > > Lucene.Net.Store.FSDirectory.Open(indexDir);
> > >  IndexReader ir = DirectoryReader.Open(dir); DocumentDictionary dict
> > > = new DocumentDictionary(ir, "Content", null, null);
> > > 
> > > IInputIterator iter = dict.GetEntryIterator();
> > > suggester.Build(iter); // Throws stackoverflow exception }
> > > 
> > > Converted to Java:
> > > 
> > > public void testLUCENENET594() throws Exception { // Rather than
> > > relying on a file path somewhere, we store the // files zipped in an
> > > embedded resource and unzip them to a // known temp directory for
> > > the test.
> > >  File indexDir = createTempDir("test-lucenenet-594");
> > >  File file = new
> > > File(getClass().getResource("LUCENENET-594.zip").toURI());
> > >  TestUtil.unzip(file, indexDir);
> > > 
> > > AnalyzingSuggester suggester = new AnalyzingSuggester(new
> > > org.apache.lucene.analysis.de.GermanAnalyzer(org.apache.lucene.util.
> > > Ve
> > > rsion.LUCENE_48));
> > > 
> > > org.apache.lucene.store.Directory dir =
> > > org.apache.lucene.store.FSDirectory.open(indexDir);
> > >  org.apache.lucene.index.IndexReader ir =
> > > org.apache.lucene.index.DirectoryReader.open(dir);
> > >  org.apache.lucene.search.suggest.DocumentDictionary dict = new
> > > org.apache.lucene.search.suggest.DocumentDictionary(ir, "Content",
> > > null, null);
> > > 
> > > org.apache.lucene.search.suggest.InputIterator iter =
> > > dict.getEntryIterator(); suggester.build(iter); // Throws
> > > stackoverflow exception }
> > > 
> > > Both tests use the attached zip file, LUCENENET-594.zip as an embedded resource.
> > > 
> > > I can only conclude that the data in the index is invalid in some way or it
is not valid to use the result of DocumentDictionary.GetEntryIterator() in conjunction with
AnalyzingSuggester.Build().
> > > 
> > > Do note that the Automaton functionality was intentionally made recursive (https://issues.apache.org/jira/browse/LUCENE-6156),
and since it is based on a regular expression, inputs that cause too many matches can overflow
the call stack.
> > > 
> > > There is some information online about how to use the AnalyzingSuggester:
> > > 
> > > http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-suggest
> > > er
> > > .html
> > > http://www.programcreek.com/java-api-examples/index.php?api=org.apac
> > > he .lucene.search.suggest.analyzing.AnalyzingSuggester
> > > http://puneetkhanal.blogspot.com/2013/04/simple-auto-suggester-using
> > > -l
> > > ucene-41.html
> > > 
> > > You might also try analyzing the tests to determine the correct usage (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests.Suggest/Suggest/Analyzing/AnalyzingSuggesterTest.cs).
> > > 
> > > None of these examples use the DocumentDictionary.GetEntryIterator() in conjunction
with AnalyzingSuggester.Build(). I suspect that they either weren't designed to be used together
or the data in your "Content" field isn't what is expected as valid input.
> > > 
> > > I suggest you add the details of how you are creating the index and what usage
example(s) you are following to the JIRA issue (https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-594)
so we can try to work out whether there is something wrong with the input data or the usage
of AnalyzingSuggester is incorrect.
> > > 
> > > Thanks,
> > > Shad Storhaug (NightOwl888)
> > > 
> > > -----Original Message-----
> > > From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> > > Sent: Tuesday, August 8, 2017 6:29 PM
> > > To: user@lucenenet.apache.org
> > > Subject: RE: Stackoverflow exception
> > > 
> > > Hello Shad,
> > > 
> > > i can confirm that the bug in the CI build still exists.
> > > 
> > > The code to reproduce the issue was in my inital mail.
> > > 
> > > I think that the problem is not the code to reproduce but the index data needed.
> > > 
> > > I could build a test program and send it to you along with the index data directory.
So you could debug for yourself.
> > > 
> > > I would send you a link to the resulting zip on my google drive to your e-mail-address.
I'm not allowed to share the data to the public.
> > > 
> > > kind regards
> > > 
> > > Oliver
> > > 
> > > > Shad Storhaug <shad@shadstorhaug.com> hat am 8. August 2017 um 11:30
geschrieben:
> > > > 
> > > > Oliver,
> > > > 
> > > > There have been a lot of bugs fixed since the last beta release. It would
be useful to know whether the bug still exists in the latest CI build so we aren't spending
time working on bugs that have already been fixed. It shouldn't take you long to swap out
the packages just to test out whether the problem is still present.
> > > > 
> > > > If the problem still exists in the CI build, please provide us with the
minimal code to reproduce it. A console application that reproduces the issue would be fine,
but it would be ideal if you provide a pull request on GitHub (https://github.com/apache/lucenenet/pulls)
with a test in the AnalyzingSuggesterTest class (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests.Suggest/Suggest/Analyzing/AnalyzingSuggesterTest.cs)
that fails with the issue (and mark it with the [LuceneNetSpecific] attribute) so we can ensure
the bug stays fixed throughout future porting efforts. Once we have a test, we can port it
back to Java and step through to find out where the execution paths diverge. Alternatively,
if you are willing to do the work of comparing with Java to find out where the problem is,
you could submit a PR containing both the test and the fix for it.
> > > > 
> > > > Despite the fact there are nearly 8000 tests, there are some dusty corners
that are not covered and I think you may have stumbled upon one of them. And no, you are the
first to report this issue to us.
> > > > 
> > > > All I can tell you is that the download counts on the new NuGet packages
such as Lucene.Net.Suggest, Lucene.Net.Highlighter, Lucene.Net.Facet, and Lucene.Net.Spatial
are much lower than I would expect them to be for a beta, and it would be extremely helpful
if there were some people dedicated to finding bugs in these packages and reporting them to
us.
> > > > 
> > > > Thanks,
> > > > Shad Storhaug (NightOwl888)
> > > > 
> > > > -----Original Message-----
> > > > From: Oliver Albrecht [mailto:o.albrecht@oliver-albrecht.de]
> > > > Sent: Tuesday, August 8, 2017 3:08 PM
> > > > To: user@lucenenet.apache.org
> > > > Subject: RE: Stackoverflow exception
> > > > 
> > > > Hello Itamar, hello Shad,
> > > > 
> > > > thanks for the fast response.
> > > > 
> > > > I could you provide the callstack, but i think it's useless because it's
full of calls to IsFinite(State s, OpenBitSet path, OpenBitSet visited) because it is a recursive
function. For the same reason i think that using the CI-Build wouldn't change anything. The
IsFinite-function is still working recursively.
> > > > 
> > > > I've tried to replace the recursion with a stack based approach (using
Stack), but i'm not sure if my implementation is correct.
> > > > 
> > > > How ever, if i use my non-recursive version of IsFinite it crashes with
a stackoverflow exception in GetFiniteStrings(State s, HashSet pathstates, HashSet strings,
Int32sRef path, int limit), which is also a recursive function. But this function is to complex
for me to convert it into a non-recursive version without exact knowledge what the function
should do.
> > > > 
> > > > In my opinion is the replacement of the recursion with a non-recursive
approach the only solution. Does no one else have this problem? I think to have an index with
4000 documents and a size with 15 MB is not so extraordinary. Or is this only a problem how
the suggester works?
> > > > 
> > > > I'm just try to use lucene to build a fulltext query engine for our internal
dms system. The system holds currently 450.000 documents with ca. 50 GB of data. I think the
final index will be around 2 GB of size.
> > > > 
> > > > kind regards
> > > > 
> > > > Oliver
> > > > 
> > > > > Shad Storhaug <shad@shadstorhaug.com> hat am 7. August 2017
um 19:06 geschrieben:
> > > > > 
> > > > > Hi Oliver,
> > > > > 
> > > > > In addition to providing the full stack trace that Itamar
> > > > > mentioned, could you confirm the problem still exists if you use
> > > > > the latest CI build here:
> > > > > [https://www.myget.org/gallery/lucene-net-ci?](https://www.myget
> > > > > .o
> > > > > rg
> > > > > /g
> > > > > allery/lucene-net-ci)
> > > > > 
> > > > > Thanks,
> > > > > Shad Storhaug (NightOwl888)
> > > > > 
> > > > > -----Original Message-----
> > > > > From: itamar.synhershko@gmail.com
> > > > > [mailto:itamar.synhershko@gmail.com]
> > > > > On Behalf Of Itamar Syn-Hershko
> > > > > Sent: Monday, August 7, 2017 9:06 PM
> > > > > To: user@lucenenet.apache.org
> > > > > Subject: Re: Stackoverflow exception
> > > > > 
> > > > > What is the full stacktrace please?
> > > > > 
> > > > > --
> > > > > 
> > > > > Itamar Syn-Hershko
> > > > > Freelance Developer & Consultant Elasticsearch Partner Microsoft
> > > > > MVP | Lucene.NET PMC http://code972.com | @synhershko
> > > > > <https://twitter.com/synhershko> http://BigDataBoutique.co.il/
> > > > > 
> > > > > On Mon, Aug 7, 2017 at 4:49 PM, Oliver Albrecht < o.albrecht@oliver-albrecht.de>
wrote:
> > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > i'm using a DocumentDictionary to feed an AnalyzingSuggester
> > > > > > using the following code:
> > > > > > 
> > > > > > SnippetAnalyzingSuggester suggester = new
> > > > > > AnalyzingSuggester(new
> > > > > > GermanAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48));
> > > > > > 
> > > > > > Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.
> > > > > > Open(indexDir);
> > > > > > 
> > > > > > IndexReader ir = DirectoryReader.Open(dir);
> > > > > > 
> > > > > > DocumentDictionary dict = new DocumentDictionary(ir,
> > > > > > "Content", null, null);
> > > > > > 
> > > > > > suggester.Build(dict.GetEntryIterator());
> > > > > > 
> > > > > > I get a stackoverflow exception on suggester.Build. The
> > > > > > exception throws in Lucene.Net.Util.Automaton.SpecialOperations.IsFinite.
> > > > > > 
> > > > > > The index contains 10.000 documents and has no payload and no
weight.
> > > > > > 
> > > > > > Kind regards
> > > > > > 
> > > > > > Oliver

Mime
View raw message