Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@pdfbox.apache.org
Date: Wed, 22 Jan 2014 09:41:20 +0000 (UTC)
From: "Timo Boehme (JIRA)" <jira@apache.org>
To: dev@pdfbox.apache.org
Message-ID: <JIRA.12684034.1386769838347.45972.1390383680989@arcas>
In-Reply-To: <JIRA.12684034.1386769838347@arcas>
References: <JIRA.12684034.1386769838347@arcas>
Subject: [jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight
 memory usage
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/PDFBOX-1808?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D138=
78461#comment-13878461 ]=20

Timo Boehme commented on PDFBOX-1808:
-------------------------------------

[~jguyenot] please inform yourself about the meaning of the memory statisti=
cs provided by Java. *Total memory* is (as the name says) all the memory th=
e VM uses. What you want is the used memory (by your application). This has=
 to be calculated by totalMem - freeMem (see e.g. http://stackoverflow.com/=
questions/3571203/what-is-the-exact-meaning-of-runtime-getruntime-totalmemo=
ry-and-freememory)

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Assignee: Andreas Lehmk=C3=BChler
>            Priority: Critical
>              Labels: performance
>         Attachments: 1808-java char copyof.jpg, 1808-java char copyofrang=
e.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSI=
ER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_proj=
ect.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStrip=
per use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is cal=
led.
> You can see my code bellow:
> double virgule =3D Math.pow(10, 2);
> =09=09System.out.println("START - Total memory (Mo): " + Math.round((Runt=
ime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd =3D PDDocument.load(file);
> =09=09System.out.println("PDDocument getNumberOfPages - Nombre de pages: =
" + cd.getNumberOfPages());
> =09=09System.out.println("PDDocument load - Total memory (Mo): " + Math.r=
ound((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText =3D "";
> try{
> =09PDFTextStripper stripper =3D new PDFTextStripper();
> =09pdfText =3D stripper.getText(cd);
> =09=09=09System.out.println("PDFTextStripper getText - Total memory (Mo):=
 " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / v=
irgule);
> =09stripper.resetEngine();
> =09stripper =3D null;
> =09=09=09System.out.println("PDFTextStripper resetEngine - Total memory (=
Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule)=
 / virgule);
> }
> finally{
> =09if( cd!=3Dnull ){
> =09=09cd.close();
> =09=09cd =3D null;
> =09=09=09=09System.out.println("PDDocument close - Total memory (Mo): " +=
 Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgu=
le);
> =09}
> }
> retour =3D new TextField(fieldName, pdfText, Field.Store.NO);
> =09=09System.out.println("TextField - Total memory (Mo): " + Math.round((=
Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)