jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Boneschanscher <jackrab...@boneschanscher.net>
Subject Re: OutOfMemoryError on reindexing jackrabbit
Date Thu, 25 Jun 2009 19:05:59 GMT
Hm,

Sweet. I'll probably backport it to 1.4. I guess you mean with catching 
any runtime exception any throwable ;-)

Thanks and keep up the great work!

Johannes

Marcel Reutegger wrote:
> Hi,
>
> this is 'fixed' in the 1.5 release by catching any runtime exception
> that might be thrown during pdf text extraction. it's not perfect, but
> it keeps the system running.
>
> regards
>  marcel
>
> 2009/6/22 Johannes Boneschanscher <jackrabbit@boneschanscher.net>
>   
>> Hi fellow jackrabbit users,
>>
>> On reindexing the entire Jackrabbit 1.4 repository I get the following problem. With
the use of Sun JRE 6 I got the following stacktace (Java 5 doesn't give any):
>>
>> java.lang.OutOfMemoryError: Java heap space
>>   at java.util.Arrays.copyOf(Arrays.java:2734)
>>   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>   at java.util.ArrayList.add(ArrayList.java:351)
>>   at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
>>   at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
>>   at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
>>   at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
>>   at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
>>   at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
>>   at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
>>   at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
>>   at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
>>   at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
>>   at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
>>   at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
>>   at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
>>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
>>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
>>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
>>   at org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>>
>> This is very unexpected because memory usage stays between 128 and 256 Mb of memory
and the maximum heapsize is set to 1,3 Gigabyte. Also system memory is readily available.
>>
>> It may be related to:
>>
>> https://issues.apache.org/jira/browse/PDFBOX-313
>>
>> Is this resolved in a newer 1.4 version of Jackrabbit? We have a text-extractor build
with the following info in the META-INF pom.properties file:
>>
>> #Generated by Maven
>> #Fri Jan 11 14:40:02 EET 2008
>> version=1.4
>> groupId=org.apache.jackrabbit
>> artifactId=jackrabbit-text-extractors
>>
>> Regards,
>>
>> Johannes
>>     


Mime
View raw message