lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Stephan.Oehl...@microtool.de>
Subject OutOfMemoryException when indexing document
Date Mon, 27 Aug 2012 09:25:25 GMT
Hi all,

we are using Lucene.Net 2.9.4g and are getting an OutOfMemoryException
when indexing a certain document.

The document is a zip file which contains a text file with lots and lots
of hexadecimal numbers in pairs of two characters, separated by spaces
(e.g. lines like this one: 48690 - 47 32 68 40 1f f8 ce 32 01 00 00 03
00 00 04 02 ce c2 17 1f aa 24 da 06 90 06 00 00 f4 36 02 00 ff ff ff ff
05 00 ff ff 34 41 ff ff 06 fa 00 ff ff ee 02 05 00 00 00 00 dd)

The OutOfMemoryException happens when we call
Lucene.Net.Analysis.TeeSinkTokenFilter.ConsumeAllTokens(). We are
observing that within the instance of the embedded class
Lucene.Net.Analysis.TeeSinkTokenFilter.SinkTokenStream the field
List<AttributeSource.State> cachedStates contains millions of entries
before the exception happens. The exception occurs within the method
Lucene.Net.Analysis.Token.Clone(), line "t.termBuffer = new
char[termBuffer.Length];".

It seems our problem is the amount of tokens created in the indexing
phase. Do you know if this behavior would be 
- A bug in Lucene.Net or
- Wrong usage of the API or
- Bad configuration of the indexer?

Thank you very much
Stephan Oehlert

Mime
View raw message