pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas DiPiazza (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-3856) Non-large PDF's can cause Out of Memory Exceptions
Date Wed, 05 Jul 2017 19:36:00 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas DiPiazza updated PDFBOX-3856:
--------------------------------------
    Description: 
We are using an application that attempts to make PDFs searchable using Apache Tika which
in downstream uses PDF Box to parse PDFs to extract the body of a PDF in text to make it searchable.


We allow basically any PDF from anywhere to come in as long as it isn't too large (9 MB).

However, we are noticing some PDFs, even though they are not that large in file size, can
cause zip bombs to eat up all the heap space and crash the JVM.

There is some sort of Object[] array that has millions of {code}org.apache.pdfbox.text.TextPosition{code}

Here is a snapshot of the heapdump: !Pasted image at 2017_07_05 02_26 PM.png|thumbnail!

Is there a setting to limit the size of this particular array so that it doesn't cause a memory
bomb?


  was:
We are using an application that attempts to make PDFs searchable using Apache Tika which
in downstream uses PDF Box to parse PDFs to extract the body of a PDF in text to make it searchable.


We allow basically any PDF from anywhere to come in as long as it isn't too large (9 MB).

However, we are noticing some PDFs, even though they are not that large in file size, can
cause zip bombs to eat up all the heap space and crash the JVM.

There is some sort of Object[] array that has millions of {code}org.apache.pdfbox.text.TextPosition{code}

Here is a snapshot of the heapdump: !attachment-name.jpg|thumbnail!

Is there a setting to limit the size of this particular array so that it doesn't cause a memory
bomb?



> Non-large PDF's can cause Out of Memory Exceptions
> --------------------------------------------------
>
>                 Key: PDFBOX-3856
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3856
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Nicholas DiPiazza
>            Priority: Blocker
>         Attachments: Pasted image at 2017_07_05 02_26 PM.png
>
>
> We are using an application that attempts to make PDFs searchable using Apache Tika which
in downstream uses PDF Box to parse PDFs to extract the body of a PDF in text to make it searchable.

> We allow basically any PDF from anywhere to come in as long as it isn't too large (9
MB).
> However, we are noticing some PDFs, even though they are not that large in file size,
can cause zip bombs to eat up all the heap space and crash the JVM.
> There is some sort of Object[] array that has millions of {code}org.apache.pdfbox.text.TextPosition{code}
> Here is a snapshot of the heapdump: !Pasted image at 2017_07_05 02_26 PM.png|thumbnail!
> Is there a setting to limit the size of this particular array so that it doesn't cause
a memory bomb?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message