pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wang <arthurwang2...@hotmail.com>
Subject Fw: Performance issue with PDFBox 2.0.8
Date Tue, 17 Apr 2018 21:40:25 GMT
Arthur Wang has shared OneDrive files with you. To view them, click the links below.


<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>

fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>

<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>

downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>




Arthur Wang has shared a OneDrive file with you. To view it, click the link below.


<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>


[https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

[https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
Shared via OneDrive




Tilman,


Since my email got rejected due to the size limit of apache mail server--1m. I send it again
here.


First, thank you very much for the extra information and udpate.


My application is an internal web based production system. Many designers in our graphic department
uploaded the print-ready file to the system every hours, and other users include prepress,
press, shipping, customers will log into the system to download the files. The print-ready
pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G
to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage,
downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used
to show a download icon on the download page instead of the thumbnail, but users have to download
the file to their local computer before actually seeing it. Sometimes the fileListPage show
a long list of files, people get confused, it would be more convenient for the user to have
a peek of the file before actually download it. so it's better to have a thumbnail on the
download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or
50 M file in one or two seconds by the apache server?


I copied my code below for you reference.(one is for testing, the other one is for production
.)


Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only
convert the first page, the best I can do it 7 seconds. that would be very slow for web application.
If by adding a GPU, the performance could get better, I would certainly like to try, just
not sure if it's going to work.


******************below are testing code running on eclilpse platform**************


package com.test;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import org.apache.commons.lang3.time.StopWatch;
import org.apache.commons.lang3.StringUtils;

public class PdfToImage {

    private static final String OUTPUT_DIR = "/Users/someone/Desktop/";

    public static void main(String[] args) throws Exception{

        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        StopWatch stopwatch = new StopWatch();

        stopwatch.start();

        try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman
& hiss - PPHI101201 - FV.pdf"))){
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            pdfRenderer.setSubsamplingAllowed(true);
            //for (int page = 0; page < document.getNumberOfPages(); ++page)
            for (int page = 0; page < 1; ++page)
            {
                BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
//<--this number have performance impact
                String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page
+ ".jpg";
                ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
            }
            document.close();
        } catch (IOException e){
            System.err.println("Exception while trying to create pdf document - " + e);
        }

         stopwatch.stop(); // optional
        System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");


    }
    //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
    //try Ashley without set property: 4 pages@70074 milliseconds
    //try Ashley with property set:   4 pagesQ@32552 milliseconds
    //try have subSampling true set: 4 pages@9481 milliseconds
    //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
    //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
    //try Herman: 1 page@7625
    //try Ashley: 1 page@3237
    //try Ashely with 72 dpi: 1 page@2807
    //try Herman with 72 dpi: 1 page@6788
    //try herman without subSampling true setting: 1 page@7087

}



*****************below is production code running as an action class of struts *********


public void processPdf(String pdfFilePath, String imageFilePath){

        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            pdfRenderer.setSubsamplingAllowed(true);
            //for (int page = 0; page < document.getNumberOfPages(); ++page)
            for (int page = 0; page < 1; ++page)
            {
                BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);

                ImageIOUtil.writeImage(bim, imageFilePath, 72);
            }
            document.close();
        } catch (IOException e){
                log.info("Exception while trying to create pdf document - " + e);
        }


    }


*********************



________________________________
From: Tilman Hausherr <THausherr@t-online.de>
Sent: Tuesday, April 17, 2018 10:39 AM
To: users@pdfbox.apache.org
Subject: Re: Performance issue with PDFBox 2.0.8

Hi,

I ran the Ashley file through the profiler, most time is used for
decoding the jpeg files within and converting some of the from CMYK to
RGB. Nothing to optimize. I also found another one-time initialization
that takes 100-300ms, which I will add to the next version of PDFDebugger.

     FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);

I also tested the UsePureJavaCMYKConversion, it made rendering much
slower. IIRC, that option only helps with files with many tiny CMYK images.

I have committed a change that adds the subsampling option to
PDFToImage, that version will be available within a few hours at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
look for todays date.

Or get the source code here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup

What type of application are you creating? If you want to show a PDF in
the browser, PDF.js works nicely, is free and included in firefox. If
you want to do thumbnails, then you should use a smaller dpi value. In
that case using subsampling would help even more.

Tilman



Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
> Hi,
>
> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
> settings. It is unclear if a mac has a similar setting.  This url
> http://www.macos.utah.edu/documentation/administration/pmset.html
> shows there is a setting for "better performance" but I don't know if
> that does the same as on Windows where I get a performance doubling.
> Try PDFDebugger, it has a built-in benchmark feature, it shows the
> rendering speed in the status line.
>
> I'm also avoiding that one-time initializations are part of the
> benchmark results with this code that is also in PDFDebugger:
>
>         // trigger premature initializations for more accurate
> rendering benchmarks
>         // See discussion in PDFBOX-3988
>         if (PDType1Font.COURIER.isStandard14())
>         {
>             // Yes this is always true
>             PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>             PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>             IIORegistry.getDefaultInstance();
>         }
>
> I see you're using the PDFToImage utility. That one doesn't support
> subsampling yet, it has been on my "todo" list for a few days, I'll
> try to do it tonight... But PDFToImage is really just a command line
> utility.
>
> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>
> Another way to convert to images is explained here:
> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>
>
> there call pdfRenderer.setSubsamplingAllowed(true) to activate
> subsampling. PDFDebugger also supports it in the menu.
>
> Tilman
>
> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>> Tilman,
>>
>>
>> Thanks for the quick response and testing on my case. Below is my
>> java code, my test result after adding the subsampling allowed. For
>> the first page of ashley file, it cost 3362 milliseconds.
>>
>> For the Gill file, the time elapsed is 2456 milliseconds.
>>
>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>> access. Maybe there is something wrong with my code? I would
>> appreciate if you take a look at my code.
>>
>>
>> Best,
>>
>>
>> Arthur
>>
>>
>> *******************
>>
>> import org.apache.pdfbox.tools.PDFToImage;
>> //import java.awt.image.BufferedImage;
>> import java.io.File;
>> //import java.io.IOException;
>> //import java.io.OutputStream;
>> import org.apache.commons.lang3.time.StopWatch;
>>
>>
>> public class PdfToImage2 {
>>
>>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>
>>      public static void main(String[] args) throws Exception{
>>
>>          String pdfPath = "/Users/someone/Desktop/Ashley
>> NJ_HHL101125_FV.pdf";
>>          //config option 2:convert page 1 in pdf to image
>>          String [] args_1 =  new String[13];
>>          args_1[0] = "-startPage";
>>          args_1[1] = "1";
>>          args_1[2] = "-endPage";
>>          args_1[3] = "1";
>>          args_1[4] = "-outputPrefix";
>>          args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>          args_1[6] = pdfPath;
>>          args_1[7] =
>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>          args_1[8] = "true";
>>          args_1[9] = "-dpi";
>>          args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>> milliseconds, @72--> 3362milliseconds
>>          args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>          args_1[12] = "true";
>>
>>          File f = new File(args_1[5]+"1.jpg");
>>          if(f.exists() && !f.isDirectory()) {
>>              System.out.println("file exist already");;
>>          }
>>          else{
>>
>>              StopWatch stopwatch = new StopWatch();
>>
>>              stopwatch.start();
>>
>>                try {
>>
>>                  System.setProperty("sun.java2d.cmm",
>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>                  PDFToImage.main(args_1);
>>                  System.out.println("Done!");
>>                } catch (Exception e) {
>>                    System.err.println("Exception while trying to
>> create pdf document - " + e);
>>                }
>>
>>                   stopwatch.stop(); // optional
>>                  System.out.println("Time elapsed is "+
>> stopwatch.getTime() + " milliseconds");
>>
>>
>>          }//else
>>
>>          //first try without setting property: 3779 milliseconds
>>          //second try with the property set: 3852 milliseconds
>>          //third try with subsamplingAllowed: 3362 milliseconds
>>
>>      }
>>
>> *******************************
>>
>> ________________________________
>> From: Tilman Hausherr <THausherr@t-online.de>
>> Sent: Monday, April 16, 2018 10:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Performance issue with PDFBox 2.0.8
>>
>> The java code didn't get through, most attachments get deleted. Call
>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>
>> I had a look at your files... These are not extremely slow renderings. 4
>> seconds for such a page is pretty good.
>>
>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>
>> Tilman
>>
>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>> the links below.
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>
>>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>
>> Ashley
>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> Shared via OneDrive
>>
>>
>>
>>> Ashley NJ_HHL101125_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>> Gill1-1356_KM102685-INS_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>>
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>
>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>         [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>
>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>> cost 32 seconds, if only process the first page, it cost about 4
>>> seconds.
>>>
>>>
>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>> drive is Intel Optane SSD NVMe.
>>>
>>> Once the JPG image is produced, the access of the image is almost
>>> instant regardless the size of the image file, so the open and close
>>> time of the image file are insignificant and could be ignored.
>>>
>>>
>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>> have the sample code for PDFRenderer ? attached file
>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>> the size 1/10th of the other one, the processing time is cut down to
>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>> does matter.
>>>
>>>
>>> thanks,
>>>
>>>
>>> Arthur
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:* Tilman Hausherr <THausherr@t-online.de>
>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>> *To:* users@pdfbox.apache.org
>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>> Please
>>> - retry with the current version 2.0.9
>>> - share your file for a profiler analysis
>>> - as said by Itai (who implemented it) try enabling subsampling in
>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>> whether the quality is OK for you.
>>> - set the energy settings of your computer to maximum or at least to
>>> "balanced", not to "energy save"
>>> - don't know if adding GPU will help;
>>> - try also the
>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>
>>> The speed is not related to the size but to the complexity. 32 seconds
>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>> illustrations" with nested patterns or large shadings may be slow.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>> Hi, everyone,
>>>>
>>>>
>>>>
>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>> production to convert pdf into image for display. it works very well
>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>> however, it slows down very much when the file size is increased to 50
>>> M. it takes about 70,000 milliseconds, after setting system property
>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>> increase the performance to 32550 milliseconds, which almost double
>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>> there any other way to speed up the performance? would adding a GPU
>>> into the server help the performance? or any other software or
>>> hardware solution could help on the processing speed? My current
>>> server come with 32 G RAM, and the server never used more than half
>>> of it.
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message