pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
Date Sat, 04 Jan 2014 12:12:52 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862278#comment-13862278
] 

Andreas Lehmkühler commented on PDFBOX-1808:
--------------------------------------------

I'm using Java VisualVM (it's part of the jdk) as profiler. It has a lot of monitoring features,
e.g. one can see all living objects so that it is simply possible to see if those can be finalized
or not.

In my environment all objects were released, at least after triggering the GC.

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>              Labels: performance
>         Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java
usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png,
s5-2.png, s50-1.png, s50-2.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot
of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
> 		System.out.println("START - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
> 		System.out.println("PDDocument getNumberOfPages - Nombre de pages: " + cd.getNumberOfPages());
> 		System.out.println("PDDocument load - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> String pdfText = "";
> try{
> 	PDFTextStripper stripper = new PDFTextStripper();
> 	pdfText = stripper.getText(cd);
> 			System.out.println("PDFTextStripper getText - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> 	stripper.resetEngine();
> 	stripper = null;
> 			System.out.println("PDFTextStripper resetEngine - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> }
> finally{
> 	if( cd!=null ){
> 		cd.close();
> 		cd = null;
> 				System.out.println("PDDocument close - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> 	}
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
> 		System.out.println("TextField - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message