pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Re[10]: PDFRenderer, PDDocument memory issue
Date Wed, 01 Jul 2015 14:31:39 GMT


> On 1 Jul 2015, at 05:15, Alex Sviridov <ooo_saturn7@mail.ru> wrote:
> 
> Ok. Thank you again. I just don't understand one thing. What is the reason to keep so
large data if I only need to take page images and the most important I DO IT BY PAGE?
> 
> Is there no way not to keep data for previous pages if I need only data for page N?

Try profiling PDFBox to see what that data actually is. We don't cache page resources anymore.
It could be cached stream data, or fonts, perhaps.

-- John

> Среда,  1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler <andreas@lehmi.de>:
>> 
>> 
>>> Alex Sviridov < ooo_saturn7@mail.ru > hat am 1. Juli 2015 um 13:59 geschrieben:
>>> 
>>> 
>>> Ok. Thank you very much for explanation. Could you say where this scratch
>>> file is located linux/windows?
>> java.io.File.createTempFile is used to create that file. It uses the default
>> temp directory. It's "/tmp" on linux. I'm not sure for windows as different
>> environment variables (TMP, TEMP, USERPROFILE, ....) are used to search for such
>> a directory.
>> 
>> You may define your own temp directory using the following parameter when
>> starting your application
>> 
>> -Djava.io.tmpdir=PATH-TO-YOUR-TEMP
>> 
>> 
>>> 
>>> 
>>> Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andreas@lehmi.de
>:
>>>>> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1. Juli 2015 um 13:38
>>>>> geschrieben:
>>>>> 
>>>>> 
>>>>> The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
>>>> Ah, that explains a lot. The pdf is a scanned document, every page holds
a
>>>> color
>>>> image, consuming a lot of memory when processed
>>>> 
>>>>> I tried with load (fileName,true). The result - now I don't have memory
>>>>> problems. However now I have 2 problems:
>>>>> 
>>>>> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
>>>>> One
>>>>> thumbnail image is loaded about 4 seconds!
>>>> If it comes to huge pdfs, you have to die one death. Either you provide
>>>> enough
>>>> memory to do all the stuff in memory (fast) or you use a scratch file to
save
>>>> memory (slow)
>>>> 
>>>> And yes, there is room for an improvement of the memory handling (read on
>>>> demand, remove after usage) in PDFBox, but that is some future feature.
>>>> Patches
>>>> are welcome.
>>>> 
>>>>> 2) Besides, as you see thumbnail images are loaded in separate thread.
>>>>> While
>>>>> this thread is running and I try to
>>>>> get big image for main content using   BufferedImage
>>>>> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
>>>>> following exception:
>>>>> 
>>>>> java.io.IOException: java.util.zip.DataFormatException: unknown compression
>>>>> method
>>>>>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>>>     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>     at
>>>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>>>>>     at
>>>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>>>>>     at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>>>     at
>>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>>>>>     at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>>>>>     at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>>>>>     at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>>>>     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>>>     at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>>>>     at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>>>>>     at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>>>>>   ....
>>>>>     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>> Caused by: java.util.zip.DataFormatException: unknown compression method
>>>>>     at java.util.zip.Inflater.inflateBytes(Native Method)
>>>>>     at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>>>     at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>>>     at
>>>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>>>>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>>>     ... 20 more
>>>>> 
>>>>> How to solve these problems?
>>>> PDFBox isn't supposed to be thread safe.
>>>> 
>>>>> 
>>>>> 
>>>>> Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler <
 andreas@lehmi.de
>>>>>> :
>>>>>> 
>>>>>> 
>>>>>>> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1. Juli 2015
um 13:09
>>>>>>> geschrieben:
>>>>>>> 
>>>>>>> 
>>>>>>> I decided to show all the code. I also send the pdf file - some
file
>>>>>>> from
>>>>>>> internet I use for testing.
>>>>>> The attachment didn't make it due to some restrictions to the mailing
>>>>>> list.
>>>>>> Please post a link to the origin source or another place where we
can
>>>>>> download
>>>>>> the pdf in question.
>>>>>> 
>>>>>>> 
>>>>>>> Task task = new Task() {
>>>>>>>     @Override protected Integer call() throws Exception {
>>>>>>>         for (int i=0;i<model.getTotalPages();i++){
>>>>>>>             System.out.println("Point a:"+i);
>>>>>>>             WritableImage writableImage=model.getPageThumbImage(i);
>>>>>>>             System.out.println("Point b:"+i);
>>>>>>>             ImageView imageView=new ImageView(writableImage);
>>>>>>>             System.out.println("Point c:"+i);
>>>>>>>             Label label=new Label(Integer.toString(i+1));
>>>>>>>             System.out.println("Point d:"+i);
>>>>>>>             VBox vBox=new VBox(imageView,label);
>>>>>>>             System.out.println("Point e:"+i);
>>>>>>>             vBox.setAlignment(Pos.CENTER);
>>>>>>>             vBox.setStyle("-fx-padding:5px 5px 5px
>>>>>>> 5px;-fx-background-color:red");
>>>>>>>             System.out.println("Point f:"+i);
>>>>>>>             Platform.runLater(new Runnable() {
>>>>>>>                 @Override
>>>>>>>                 public void run() {
>>>>>>>                      thumbFlowPane.getChildren().add(vBox);
>>>>>>>                 }
>>>>>>>             });
>>>>>>>         }
>>>>>>>         return null;
>>>>>>>     }
>>>>>>> };
>>>>>>> new Thread(task).start();
>>>>>>> 
>>>>>>> And here is the tail of the output
>>>>>>> ....
>>>>>>> Point a:30
>>>>>>> Point b:30
>>>>>>> Point c:30
>>>>>>> Point d:30
>>>>>>> Point e:30
>>>>>>> Point f:30
>>>>>>> Point a:31
>>>>>>> 
>>>>>>> What is scratch file? Sorry, I don't understand you.
>>>>>> 
>>>>>> PDFBox holds a lot of temporary data in the memory. To reduce the
memory
>>>>>> footprint one can choose to use a scratch file instead, so that some/most
>>>>>> of
>>>>>> that data will be hold in a file.
>>>>>> 
>>>>>> To do so, simply use another load method, e.g. 
>>>>>> 
>>>>>> load(File file, boolean useScratchFiles)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Среда,  1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler
<
>>>>>>> andreas@lehmi.de
>>>>>>>> :
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1.
Juli 2015 um 12:58
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you for answer. I tried
>>>>>>>>> pdfbox-app-2.0.0-20150630.220424-1464.jar
>>>>>>>>> the
>>>>>>>>> result is the same.
>>>>>>>>> 
>>>>>>>>> When I create images I add them to javafx FlowPane. However,
the
>>>>>>>>> problem
>>>>>>>>> is
>>>>>>>>> not in images because I repeat - I get 400mb when I do
>>>>>>>>> pdfDocument=null,pdfRenderer=null.
>>>>>>>>> 
>>>>>>>>> Bedised, when I do pdfDocument = PDDocument.load(new
File(fileName))
>>>>>>>>> I
>>>>>>>>> don't
>>>>>>>>> have any problems with memory. 
>>>>>>>>> 
>>>>>>>>> I'm getting problem with memory when I run in for loop
>>>>>>>>> getPageThumbImage.
>>>>>>>>> 
>>>>>>>>> I am sure that the problem is in PdfBox. Please, help
me.
>>>>>>>> Maybe, but I'm not sure at all.
>>>>>>>> 
>>>>>>>> Try to use the scratch file.
>>>>>>>> 
>>>>>>>>> Среда,  1 июля 2015, 12:48 +02:00 от Andreas
Lehmkühler <
>>>>>>>>> andreas@lehmi.de
>>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Alex Sviridov <  ooo_saturn7@mail.ru >
hat am 1. Juli 2015 um
>>>>>>>>>>> 10:16
>>>>>>>>>>> geschrieben:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I want to display all page thumbnails. However
I came across
>>>>>>>>>>> memory
>>>>>>>>>>> size
>>>>>>>>>>> problem with PDFRenderer or PDDocument - I don't
know which one. 
>>>>>>>>>>> 
>>>>>>>>>>> I have the following code:
>>>>>>>>>>>    ....
>>>>>>>>>>>     private PDDocument pdfDocument;
>>>>>>>>>>>     
>>>>>>>>>>>     private PDFRenderer pdfRenderer;
>>>>>>>>>>> 
>>>>>>>>>>>     public WritableImage getPageThumbImage(int
page){
>>>>>>>>>>>         WritableImage result=null;
>>>>>>>>>>>         try {
>>>>>>>>>>>             BufferedImage bi=pdfRenderer.renderImageWithDPI(page,
>>>>>>>>>>> 12,
>>>>>>>>>>> ImageType.RGB);
>>>>>>>>>>>             result=SwingFXUtils.toFXImage(bi,
null);
>>>>>>>>>>>         } catch (IOException ex) {
>>>>>>>>>>>              ....
>>>>>>>>>>>         }
>>>>>>>>>>>         return result;
>>>>>>>>>>>     }
>>>>>>>>>>>  .....
>>>>>>>>>>> The method getPageThumbImage I run in for loop
for every page.I
>>>>>>>>>>> set
>>>>>>>>>>> java
>>>>>>>>>>> memory heap to 500mb. 
>>>>>>>>>>> And I can get about 30 images using getPageThumbImage
(if I set
>>>>>>>>>>> more
>>>>>>>>>>> memory
>>>>>>>>>>> I
>>>>>>>>>>> get more). 
>>>>>>>>>>> In my application I have real time memory graphs
and they show
>>>>>>>>>>> that
>>>>>>>>>>> memory
>>>>>>>>>>> is
>>>>>>>>>>> very fast filled. 
>>>>>>>>>>> When there is no more free memory getPageThumbImage
hangs - no
>>>>>>>>>>> exception,
>>>>>>>>>>> nothing. But the code stops.
>>>>>>>>>>> When I do pdfDocument=null,pdfRenderer=null I
get about 400mb free
>>>>>>>>>>> memory.
>>>>>>>>>>> How
>>>>>>>>>>> to solve this problem?
>>>>>>>>>> There are 2 possible issues and maybe both are relevant.
>>>>>>>>>> 
>>>>>>>>>> 1. PDFBox consumes more or less memory to load a
pdf depending on
>>>>>>>>>> the
>>>>>>>>>> size
>>>>>>>>>> and
>>>>>>>>>> the content of the pdf.
>>>>>>>>>> 
>>>>>>>>>> - Are you using the latest 2.0.0-SNAPSHOT? There
were some
>>>>>>>>>> improvements
>>>>>>>>>> concerning the memory footprint lately
>>>>>>>>>> - Try to use of a scratch file (there are load methods
including a
>>>>>>>>>> boolean
>>>>>>>>>> switcht ot activate that)
>>>>>>>>>> 
>>>>>>>>>> 2. Your own implementation consumes more or less
memory to process
>>>>>>>>>> those
>>>>>>>>>> thumbnails
>>>>>>>>>> 
>>>>>>>>>> - check if you are releasing all resources (ecspecially
those images
>>>>>>>>>> you're
>>>>>>>>>> creating) you are using during your process
>>>>>>>>>> 
>>>>>>>>>> HTH,
>>>>>>>>>> Andreas
>>>>>>>>>> 
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Alex Sviridov
>>>>>>>> 
>>>>>>>> BR
>>>>>>>> Andreas
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Alex Sviridov
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>>>> BR
>>>>>> Andreas
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Alex Sviridov
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>> 
>>> 
>>> -- 
>>> Alex Sviridov
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail:  users-help@pdfbox.apache.org
> 
> 
> -- 
> Alex Sviridov

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message