pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Søren Pedersen <sh.peder...@gmail.com>
Subject Re: AW: Possible memory leak when extracting text?
Date Fri, 10 May 2019 05:22:51 GMT
I have uploadet the PDF here: https://we.tl/t-lQusIcUiRM

I have testet with both version 2.0.13 and 2.0.15 of PDFBox, and I have run the test on a
machine with 16 GB of ram, where I allowed JVM to use 14 GB using the -Xmx14g parameter.

I took a heap dump using JVisualVM when it used approx 12 GB of memory and I can see that
98,3% of the size is taken up by int[]’s. When I dig into those they come from featuredIndices
in GlyphSubstitutionTable$LangSysTable -> langSysTable in GlyphSubstitutionTable$LangSysRecord
-> GlyphSubStitutionTable$LangSysRecord[].

I should also note that I run our app in a Docker container, like this:

docker run -d \
-p 8080:8080 \
-v /home/ec2-user/locate/build:/usr/build \
--name=locate \
openjdk:8 \
java -Xmx14g -Dserver.port=8080 -Dspring.profiles.active=prod -Djdk.tls.useExtendedMasterSecret=false
-jar /usr/build/project-web-1.3.0.war

Thanks a lot in advance!

Best regards,

On 9 May 2019, 17.59 +0200, Tilman Hausherr <THausherr@t-online.de>, wrote:
> please upload to a sharehoster and also mention what version you are using,
> should be 2.0.15.
> Tilman
> ------------------------------------------------------------------------
> Gesendet mit der Telekom Mail App
> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
> --- Original-Nachricht ---
> Von: Søren Pedersen
> Betreff: Possible memory leak when extracting text?
> Datum: 09.05.2019, 17:07 Uhr
> An: users@pdfbox.apache.org
> Hi there
> We have an application that can index the contents of PDF files, so that we
> can use that for a search algorithm. We use the Apache PDFBox library for
> extracting text from a PDF, like this (where inputStream is a
> ByteArrayInputStream containing the contents of the PDF file):
> PDFTextStripper pdfStripper = new PDFTextStripper();
> pdDoc = PDDocument.load(inputStream,
> MemoryUsageSetting.setupTempFileOnly());
> String parsedText = pdfStripper.getText(pdDoc);
> We ran into a sample PDF file, that seems to cause a memory leak, as we get
> an OutOfMemoryError: Java heap space. I have attached the file to this
> email (not sure if that works on a mailing list?)
> Can someone try to extract the text in this PDF file, to confirm if there
> is a memory leak, and maybe bring this to the attention of the developers?
> Thanks a lot in advance!
> Best regards,
> Søren

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message