pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs
Date Sat, 09 Nov 2013 08:29:55 GMT
Hi,

there are some possible improvements

# add the bookmarks to the source files upfront - they will be merged into the target
# use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream input, RandomAccess
scratchFile) so temporary data is stored on file instead of memory to lower the memory consumption
during runtime
# enhance the way how the images are stored in the PDF e.g. by using a different compression
algorithm. This will be more complicated as you need to preprocess your PDFs but maybe it's
useful as it might help you to produce smaller result files.

BR

Maruan Sahyoun

Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <jesus.m.salvo@gmail.com>:

> pdfbox-1.8.2
> tika-app-1.4 ( I'm including Apache Tika as I just found out that
> Apache Tika comes with pdfbox )
> 
> I have various existing PDFs that I need to merge into one PDF. The
> number of PDFs to be merged into one can be varied .. anywhere from 2
> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
> merged can also be varied. These PDFs are mostly scanned via an EDRMS
> like HP TRIM7 ... so documents say like ... medical reports, etc ..
> and up as PDFs. Thus, each page of the PDF is an image instead of
> text.
> 
> Merging them into a single PDF is no problem using the PDFMergerUtility.
> 
> After I have merged them into a single PDF, I then need to add
> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
> can quickly jump to a section of the merged PDF to see one of the
> merged PDFs.
> 
> The issue is the memory consumption .. the merged PDF tend to be quite
> large ( anywhere from 200MB to 1GB ... again because each individual
> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
> image ). Now having multiple of these merges run in parallel, and I
> can easily consume the entire heap allocated to the JVM.
> 
> To create the bookmarks, I have to open the large / merged PDF.
> 
> So the question is, is there a better way of creating bookmarks so as
> that the amount of memory consumed is minimal ?
> 
> Note that I am making sure I am calling PDDocument.close() in a
> finally clause. See snippets below.
> 
> 
> 1) To create the bookmarks, I have to find out the number of pages in
> each PDF before they are merged. Something like in a loop:
> 
> PDDocument document = null;
> try {
>    document = PDDocument.load(aDownload.getLocalFile());
>    aDownload.setNumberOfPages( document.getNumberOfPages() );
> } finally {
>    if( document != null ) {
>        document.close();
>    }
> }
> 
> 2) Then I have to open the large / merged PDF file, then create the
> bookmarks using the number of pages as the guide from above ( And I
> also have to set the meta-data ... the author, date/time, subject on
> the PDF ):
> 
> private void finaliseDocument(
> final File pdfFile,
> final List<DocumentDownloadEntry> downloadEntries )
> throws Exception
> {
>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
> pdfFile.toString()));
>    PDDocument document = null;
>    try {
>        document = PDDocument.load(pdfFile);
>        document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>        document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>        document.getDocumentInformation().setAuthor(getUserName());
>        document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
> + " - " + getSubmissionType());
>        makeBookmarks( document, downloadEntries );
>        document.save(pdfFile);
>    } finally {
>        if( document != null ) {
>            document.close();
>        }
>    }
> }
> 
> private void makeBookmarks(
> final PDDocument document,
> final List<DocumentDownloadEntry> downloadEntries)
> throws Exception
> {
>        PDDocumentOutline outline =  new PDDocumentOutline();
>        document.getDocumentCatalog().setDocumentOutline( outline );
>        PDOutlineItem pagesOutline = new PDOutlineItem();
>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
>        outline.appendChild( pagesOutline );
> 
>        @SuppressWarnings("rawtypes")
>        List pages = document.getDocumentCatalog().getAllPages();
>        int pageIndex = 0;
>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>            PDPage page = (PDPage)pages.get( pageIndex );
>            pageIndex += aDownload.getNumberOfPages();
> 
>                PDPageFitWidthDestination dest = new
> PDPageFitWidthDestination();
>                dest.setPage( page );
>                PDOutlineItem bookmark = new PDOutlineItem();
>                bookmark.setDestination( dest );
> 
>                bookmark.setTitle( aDownload.getDocumentName() );
>                pagesOutline.appendChild( bookmark );
>          }
>        }
>        pagesOutline.openNode();
>        outline.openNode();
> }


Mime
View raw message