jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: OutOfMemoryError on reindexing jackrabbit
Date Thu, 25 Jun 2009 07:15:57 GMT
Hi,

this is 'fixed' in the 1.5 release by catching any runtime exception
that might be thrown during pdf text extraction. it's not perfect, but
it keeps the system running.

regards
 marcel

2009/6/22 Johannes Boneschanscher <jackrabbit@boneschanscher.net>
>
> Hi fellow jackrabbit users,
>
> On reindexing the entire Jackrabbit 1.4 repository I get the following problem. With
the use of Sun JRE 6 I got the following stacktace (Java 5 doesn't give any):
>
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2734)
>   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>   at java.util.ArrayList.add(ArrayList.java:351)
>   at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
>   at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
>   at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
>   at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
>   at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
>   at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
>   at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
>   at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
>   at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
>   at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
>   at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
>   at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
>   at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
>   at org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
>   at org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>
> This is very unexpected because memory usage stays between 128 and 256 Mb of memory and
the maximum heapsize is set to 1,3 Gigabyte. Also system memory is readily available.
>
> It may be related to:
>
> https://issues.apache.org/jira/browse/PDFBOX-313
>
> Is this resolved in a newer 1.4 version of Jackrabbit? We have a text-extractor build
with the following info in the META-INF pom.properties file:
>
> #Generated by Maven
> #Fri Jan 11 14:40:02 EET 2008
> version=1.4
> groupId=org.apache.jackrabbit
> artifactId=jackrabbit-text-extractors
>
> Regards,
>
> Johannes

Mime
View raw message