pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Wiell (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PDFBOX-313) OutOfMemoryError for larger PDF text extraction
Date Thu, 05 Feb 2009 16:47:59 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Wiell updated PDFBOX-313:
--------------------------------

    Attachment: Fix_for_PDFBOX-313.patch

The PDFStreamEngine.documentFontCache isn't very efficient in reducing the number of instantiations
of PDFont objects. 

I created FontIdentifier to use as key in the cache, instead of COSDictionary. FontIdentfier
uses base name, subtype and encoding to uniquely identify a PDFont. Not being that familiar
to PDFBox, I hope that is enough.

A patch has been attached. After applying the patch, extracting the text from the two linked
PDFs doesn't throw OutOfMemoryError anymore and the tests in TestTextStripper still passes.

PDFBOX-296 is probably a duplicate of this one.


> OutOfMemoryError for larger PDF text extraction
> -----------------------------------------------
>
>                 Key: PDFBOX-313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Priority: Minor
>         Attachments: Fix_for_PDFBOX-313.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1805929
> Originally submitted by tdonohue on 2007-10-01 13:51.
> Hello,
> I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org) version 1.4.2.
  Currently, I'm running into OutOfMemoryError exceptions whenever I attempt text extraction
from a few larger PDFs (>10MB).  I've also just tried replacing PDFBox 0.7.3 with your
latest nightly-build (from Oct 1), and the error still seems to be happening.
> My JVM options are currently set to:
> -Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
> Here's a few of the problem PDFs:
> 15MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf
> 13MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF
> Here's an example error stacktrace:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.HashMap.addEntry(HashMap.java:753)
>         at java.util.HashMap.put(HashMap.java:385)
>         at org.fontbox.cmap.CMap.addMapping(CMap.java:131)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:343)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:497)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:218)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
>         at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
>         at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
>         at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
>         at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
>         at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
>         at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:417)
>         at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
> Finally, here's how the DSpace API is calling PDFBox:
>         PDFTextStripper pts = new PDFTextStripper();
>         PDFParser parser = null;
>         String extractedText = null;
>         try
>         {
>             parser = new PDFParser(source);
>         parser.parse();
>             extractedText = pts.getText(new PDDocument(parser.getDocument()));
>         }
>         finally
>         {
>             try
>             {
>                 parser.getDocument().close();
>             }
>             catch(Exception e)
>             {
>                log.error("Error closing temporary PDF file: " + e.getMessage(), e);
>             }
>         }
> [comment on SourceForge]
> Originally sent by tdonohue.
> Logged In: YES 
> user_id=1320825
> Originator: YES
> I neglected to mention both of these PDFs were initially image-based and were recently
OCRed using Adobe Acrobat 8 Pro.  I'm not sure that would matter for PDFBox to perform text
extraction, but it's another commonality between these PDFs.
> Thanks in advance for any help you can provide!
> - Tim

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message