manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Donald Van den Driessche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
Date Tue, 19 Mar 2019 11:16:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795984#comment-16795984
] 

Donald Van den Driessche commented on CONNECTORS-1593:
------------------------------------------------------

It doesn't seem to be 1 specific document. 
The process reads an SQL table and will then retrieve the url in this table to get the document.

Each time the job processes 4100 documents. Sometimes the error occures after 200 documents,
sometimes after 1500, sometimes after 3800,... And weirdly sometimes the whole process runs
perfectly.
It's a really weird situation.

I've tried connector logging on, but there wasn't much logged.

> Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1593
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.12
>            Reporter: Donald Van den Driessche
>            Assignee: Karl Wright
>            Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions I get an
Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of memory - shutting
down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at digital.formica.manifold.connector.transformation.fetchwebresource.WebresourceFetchTransformationConnector.addOrReplaceDocumentWithException(WebresourceFetchTransformationConnector.java:118)
> {code}
> I've allocated 8g of heap size, Installed the latest version of Tika (1.20) and PDFBOX
(2.0.14).
> But no solutions found.
> After a heap dump and analyzing this dump, I notice that it is the Integer class that
takes about 2.6g of memory. 
> Any suggestions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message