pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Søren Pedersen <sh.peder...@gmail.com>
Subject Possible memory leak when extracting text?
Date Thu, 09 May 2019 15:07:31 GMT
Hi there

We have an application that can index the contents of PDF files, so that we can use that for
a search algorithm. We use the Apache PDFBox library for extracting text from a PDF, like
this (where inputStream is a ByteArrayInputStream containing the contents of the PDF file):

PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly());
String parsedText = pdfStripper.getText(pdDoc);

We ran into a sample PDF file, that seems to cause a memory leak, as we get an OutOfMemoryError:
Java heap space. I have attached the file to this email (not sure if that works on a mailing

Can someone try to extract the text in this PDF file, to confirm if there is a memory leak,
and maybe bring this to the attention of the developers?

Thanks a lot in advance!

Best regards,

View raw message