Return-Path: Delivered-To: apmail-incubator-pdfbox-dev-archive@minotaur.apache.org Received: (qmail 36021 invoked from network); 6 Feb 2009 19:17:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2009 19:17:28 -0000 Received: (qmail 623 invoked by uid 500); 6 Feb 2009 19:17:28 -0000 Delivered-To: apmail-incubator-pdfbox-dev-archive@incubator.apache.org Received: (qmail 612 invoked by uid 500); 6 Feb 2009 19:17:28 -0000 Mailing-List: contact pdfbox-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pdfbox-dev@incubator.apache.org Delivered-To: mailing list pdfbox-dev@incubator.apache.org Received: (qmail 601 invoked by uid 99); 6 Feb 2009 19:17:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Feb 2009 11:17:28 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Feb 2009 19:17:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A5E9C234C4AB for ; Fri, 6 Feb 2009 11:16:59 -0800 (PST) Message-ID: <1662904282.1233947819678.JavaMail.jira@brutus> Date: Fri, 6 Feb 2009 11:16:59 -0800 (PST) From: =?utf-8?Q?Andreas_Lehmk=C3=BChler_=28JIRA=29?= To: pdfbox-dev@incubator.apache.org Subject: [jira] Resolved: (PDFBOX-313) OutOfMemoryError for larger PDF text extraction MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PDFBOX-313?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmk=C3=BChler resolved PDFBOX-313. --------------------------------------- Resolution: Fixed Fix Version/s: 0.8.0-incubator With version 741680 a suitable key is used for caching as Daniel suggested.= Finally every works fine. > OutOfMemoryError for larger PDF text extraction > ----------------------------------------------- > > Key: PDFBOX-313 > URL: https://issues.apache.org/jira/browse/PDFBOX-313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Priority: Minor > Fix For: 0.8.0-incubator > > Attachments: Fix_for_PDFBOX-313.patch > > > [imported from SourceForge] > http://sourceforge.net/tracker/index.php?group_id=3D78314&atid=3D552832&a= id=3D1805929 > Originally submitted by tdonohue on 2007-10-01 13:51. > Hello, > I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org)= version 1.4.2. Currently, I'm running into OutOfMemoryError exceptions w= henever I attempt text extraction from a few larger PDFs (>10MB). I've als= o just tried replacing PDFBox 0.7.3 with your latest nightly-build (from Oc= t 1), and the error still seems to be happening. > My JVM options are currently set to: > -Xmx1024M -Xms1024M -XX:NewRatio=3D2 -Dfile.encoding=3DUTF-8 > Here's a few of the problem PDFs: > 15MB PDF: > https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf > 13MB PDF: > https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF > Here's an example error stacktrace: > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.HashMap.addEntry(HashMap.java:753) > at java.util.HashMap.put(HashMap.java:385) > at org.fontbox.cmap.CMap.addMapping(CMap.java:131) > at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202) > at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509) > at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380) > at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.jav= a:343) > at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) > at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin= e.java:497) > at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi= ne.java:218) > at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.= java:177) > at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja= va:339) > at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.j= ava:263) > at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java= :219) > at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:1= 52) > at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFF= ilter.java:114) > at org.dspace.app.mediafilter.MediaFilterManager.processBitstream= (MediaFilterManager.java:602) > at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(= MediaFilterManager.java:513) > at org.dspace.app.mediafilter.MediaFilterManager.filterItem(Media= FilterManager.java:461) > at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem= (MediaFilterManager.java:428) > at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersColl= ection(MediaFilterManager.java:417) > at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilter= Manager.java:359) > Finally, here's how the DSpace API is calling PDFBox: > PDFTextStripper pts =3D new PDFTextStripper(); > PDFParser parser =3D null; > String extractedText =3D null; > try > { > parser =3D new PDFParser(source); > parser.parse(); > extractedText =3D pts.getText(new PDDocument(parser.getDocume= nt())); > } > finally > { > try > { > parser.getDocument().close(); > } > catch(Exception e) > { > log.error("Error closing temporary PDF file: " + e.getMess= age(), e); > } > } > [comment on SourceForge] > Originally sent by tdonohue. > Logged In: YES=20 > user_id=3D1320825 > Originator: YES > I neglected to mention both of these PDFs were initially image-based and = were recently OCRed using Adobe Acrobat 8 Pro. I'm not sure that would mat= ter for PDFBox to perform text extraction, but it's another commonality bet= ween these PDFs. > Thanks in advance for any help you can provide! > - Tim --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.