pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guyenot Jeremy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
Date Mon, 20 Jan 2014 11:06:27 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Guyenot Jeremy updated PDFBOX-1808:
-----------------------------------

    Attachment: netbeans_project.jpg

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>              Labels: performance
>         Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java
usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg,
s5-1.png, s5-2.png, s50-1.png, s50-2.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot
of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
> 		System.out.println("START - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
> 		System.out.println("PDDocument getNumberOfPages - Nombre de pages: " + cd.getNumberOfPages());
> 		System.out.println("PDDocument load - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> String pdfText = "";
> try{
> 	PDFTextStripper stripper = new PDFTextStripper();
> 	pdfText = stripper.getText(cd);
> 			System.out.println("PDFTextStripper getText - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> 	stripper.resetEngine();
> 	stripper = null;
> 			System.out.println("PDFTextStripper resetEngine - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> }
> finally{
> 	if( cd!=null ){
> 		cd.close();
> 		cd = null;
> 				System.out.println("PDDocument close - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> 	}
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
> 		System.out.println("TextField - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000)
* virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message