pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4389) Excessive load times for large pdfs
Date Wed, 05 Dec 2018 07:37:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709694#comment-16709694

Tilman Hausherr commented on PDFBOX-4389:

I wasn't involved in this, but I doubt there will be "empty space" because nothing is given
back because COSStreams are not closed until the file is closed.

> Excessive load times for large pdfs
> -----------------------------------
>                 Key: PDFBOX-4389
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4389
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>         Environment: OpenJDK 10, Ubuntu
> pdfbox v2.0.12
> jbig2-imageio v3.0.2
>            Reporter: Ben Manes
>            Priority: Major
>         Attachments: PdfComponent.java
> We render preview images for pdfs being uploaded. This is usually quite fast, as often
these are short PDFs (e.g. shipments). One customer has a habit of uploading 6,000+ pages,
which I believe is their historicals. This can take a while, though I am currently seeing
over a minute per page:
> {{Processed page 940 / 1930 for pdf 1d2c0351-6c1f-4198-bd0b-6728927d7d00 within f1816bb9-3da2-4b61-a3d2-3ca9c419598e
in 1.443 min}}
> The operation is safely parallelized by reading the number of pages, enqueuing a task
per page index, opening the pdf in the task, and rendering the page index. Each task creates
a new {{MemoryUsageSetting}} at 2mb memory an unlimited disk. When monitoring this upload,
which will take 32 hours at this rate, the active scratch files are over 500mb. 
> {{$ du -h /tmp/cache_12639792278559363345/session_2059639776597126303/f1816bb9-3da2-4b61-a3d2-3ca9c419598e/component/pdf/pdfbox/1d2c0351-6c1f-4198-bd0b-6728927d7d00
| cut -f1 | sort -u}}
> {{2.3G}}
> {{4.0K}}
> {{524M}}
> {{531M}}
> {{552M}}
> {{653M}}
> When polling the stack traces, the threads appear to be spending most of their time on
expanding the temp file for the per-page task's loading of the pdf(s).
> Can you explain why this is so slow? My hope is that it could traverse to the page quickly,
render it, and close. In this case I might try refactoring to pool the opened documents instead
of loading anew, as previously the image rendering was performance problem (since {{KcmsServiceProvider}} is
no longer available).
> ----
> java.lang.Thread.State: RUNNABLE
>  at java.io.RandomAccessFile.setLength(java.base@10.0.1/Native Method)
>  at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:245)
>  locked <0x00000006f6268cc0> (a java.lang.Object)
>  at org.apache.pdfbox.io.ScratchFile.getNewPage(ScratchFile.java:167)
>  locked <0x00000006f6268f10> (a java.util.BitSet)
>  at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:126)
>  at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
>  at org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
>  at org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
>  at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:279)
>  at org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1299)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1127)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:913)
>  at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>  at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>  at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>  at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:949)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message