Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B47310C79 for ; Wed, 22 Jan 2014 09:41:31 +0000 (UTC) Received: (qmail 33735 invoked by uid 500); 22 Jan 2014 09:41:30 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 33361 invoked by uid 500); 22 Jan 2014 09:41:24 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 33289 invoked by uid 99); 22 Jan 2014 09:41:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 09:41:21 +0000 Date: Wed, 22 Jan 2014 09:41:20 +0000 (UTC) From: "Timo Boehme (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-1808?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D138= 78461#comment-13878461 ]=20 Timo Boehme commented on PDFBOX-1808: ------------------------------------- [~jguyenot] please inform yourself about the meaning of the memory statisti= cs provided by Java. *Total memory* is (as the name says) all the memory th= e VM uses. What you want is the used memory (by your application). This has= to be calculated by totalMem - freeMem (see e.g. http://stackoverflow.com/= questions/3571203/what-is-the-exact-meaning-of-runtime-getruntime-totalmemo= ry-and-freememory) > PDFTextStripper.getText - hight memory usage > -------------------------------------------- > > Key: PDFBOX-1808 > URL: https://issues.apache.org/jira/browse/PDFBOX-1808 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.2, 1.8.3 > Environment: Windows 7 > Java jdk 1.7.0_45 > Reporter: Guyenot Jeremy > Assignee: Andreas Lehmk=C3=BChler > Priority: Critical > Labels: performance > Attachments: 1808-java char copyof.jpg, 1808-java char copyofrang= e.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSI= ER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_proj= ect.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png > > Original Estimate: 72h > Remaining Estimate: 72h > > Hello, > i'm trying to extract text from pdfs but i can find that the PDFTextStrip= per use a lot of memory. > With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. > I also constat that the memory is'nt free after the getText method is cal= led. > You can see my code bellow: > double virgule =3D Math.pow(10, 2); > =09=09System.out.println("START - Total memory (Mo): " + Math.round((Runt= ime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > PDDocument cd =3D PDDocument.load(file); > =09=09System.out.println("PDDocument getNumberOfPages - Nombre de pages: = " + cd.getNumberOfPages()); > =09=09System.out.println("PDDocument load - Total memory (Mo): " + Math.r= ound((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > String pdfText =3D ""; > try{ > =09PDFTextStripper stripper =3D new PDFTextStripper(); > =09pdfText =3D stripper.getText(cd); > =09=09=09System.out.println("PDFTextStripper getText - Total memory (Mo):= " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / v= irgule); > =09stripper.resetEngine(); > =09stripper =3D null; > =09=09=09System.out.println("PDFTextStripper resetEngine - Total memory (= Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule)= / virgule); > } > finally{ > =09if( cd!=3Dnull ){ > =09=09cd.close(); > =09=09cd =3D null; > =09=09=09=09System.out.println("PDDocument close - Total memory (Mo): " += Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgu= le); > =09} > } > retour =3D new TextField(fieldName, pdfText, Field.Store.NO); > =09=09System.out.println("TextField - Total memory (Mo): " + Math.round((= Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > And the result into my output window: > START - Total memory (Mo): 95.0 > PDDocument getNumberOfPages - Nombre de pages: 2676 > PDDocument load - Total memory (Mo): 121.0 > PDFTextStripper getText - Total memory (Mo): 757.0 > PDFTextStripper resetEngine - Total memory (Mo): 757.0 > PDDocument close - Total memory (Mo): 757.0 > TextField - Total memory (Mo): 757.0 > pdfText - Total memory (Mo): 757.0 > I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)