pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Sviridov <ooo_satu...@mail.ru>
Subject Re[10]: PDFRenderer, PDDocument memory issue
Date Wed, 01 Jul 2015 12:15:39 GMT
 Ok. Thank you again. I just don't understand one thing. What is the reason to keep so large
data if I only need to take page images and the most important I DO IT BY PAGE?

Is there no way not to keep data for previous pages if I need only data for page N?


Среда,  1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler <andreas@lehmi.de>:
>
>
>> Alex Sviridov < ooo_saturn7@mail.ru > hat am 1. Juli 2015 um 13:59 geschrieben:
>> 
>> 
>>  Ok. Thank you very much for explanation. Could you say where this scratch
>> file is located linux/windows?
>java.io.File.createTempFile is used to create that file. It uses the default
>temp directory. It's "/tmp" on linux. I'm not sure for windows as different
>environment variables (TMP, TEMP, USERPROFILE, ....) are used to search for such
>a directory.
>
>You may define your own temp directory using the following parameter when
>starting your application
>
>-Djava.io.tmpdir=PATH-TO-YOUR-TEMP
>
>
>> 
>> 
>> Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andreas@lehmi.de
>:
>> >> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1. Juli 2015 um 13:38
>> >> geschrieben:
>> >> 
>> >> 
>> >>  The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
>> >Ah, that explains a lot. The pdf is a scanned document, every page holds a
>> >color
>> >image, consuming a lot of memory when processed
>> >
>> >> I tried with load (fileName,true). The result - now I don't have memory
>> >> problems. However now I have 2 problems:
>> >>
>> >> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
>> >> One
>> >> thumbnail image is loaded about 4 seconds! 
>> >If it comes to huge pdfs, you have to die one death. Either you provide
>> >enough
>> >memory to do all the stuff in memory (fast) or you use a scratch file to save
>> >memory (slow)
>> >
>> >And yes, there is room for an improvement of the memory handling (read on
>> >demand, remove after usage) in PDFBox, but that is some future feature.
>> >Patches
>> >are welcome.
>> >
>> >> 2) Besides, as you see thumbnail images are loaded in separate thread.
>> >> While
>> >> this thread is running and I try to
>> >> get big image for main content using   BufferedImage
>> >> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
>> >> following exception:
>> >> 
>> >> java.io.IOException: java.util.zip.DataFormatException: unknown compression
>> >> method
>> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>> >>     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>> >>     at
>> >> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>> >>     at
>> >> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>> >>     at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>> >>     at
>> >> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>> >>     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>> >>   ....
>> >>     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
>> >>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> >>     at java.lang.Thread.run(Thread.java:745)
>> >> Caused by: java.util.zip.DataFormatException: unknown compression method
>> >>     at java.util.zip.Inflater.inflateBytes(Native Method)
>> >>     at java.util.zip.Inflater.inflate(Inflater.java:259)
>> >>     at java.util.zip.Inflater.inflate(Inflater.java:280)
>> >>     at
>> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>> >>     ... 20 more
>> >> 
>> >> How to solve these problems?
>> >PDFBox isn't supposed to be thread safe.
>> >
>> >> 
>> >> 
>> >> Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler <
 andreas@lehmi.de
>> >> >:
>> >> >
>> >> >
>> >> >> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1. Juli 2015
um 13:09
>> >> >> geschrieben:
>> >> >> 
>> >> >> 
>> >> >>  I decided to show all the code. I also send the pdf file - some
file
>> >> >> from
>> >> >> internet I use for testing.
>> >> >The attachment didn't make it due to some restrictions to the mailing
>> >> >list.
>> >> >Please post a link to the origin source or another place where we can
>> >> >download
>> >> >the pdf in question.
>> >> >
>> >> >> 
>> >> >> Task task = new Task() {
>> >> >>     @Override protected Integer call() throws Exception {
>> >> >>         for (int i=0;i<model.getTotalPages();i++){
>> >> >>             System.out.println("Point a:"+i);
>> >> >>             WritableImage writableImage=model.getPageThumbImage(i);
>> >> >>             System.out.println("Point b:"+i);
>> >> >>             ImageView imageView=new ImageView(writableImage);
>> >> >>             System.out.println("Point c:"+i);
>> >> >>             Label label=new Label(Integer.toString(i+1));
>> >> >>             System.out.println("Point d:"+i);
>> >> >>             VBox vBox=new VBox(imageView,label);
>> >> >>             System.out.println("Point e:"+i);
>> >> >>             vBox.setAlignment(Pos.CENTER);
>> >> >>             vBox.setStyle("-fx-padding:5px 5px 5px
>> >> >> 5px;-fx-background-color:red");
>> >> >>             System.out.println("Point f:"+i);
>> >> >>             Platform.runLater(new Runnable() {
>> >> >>                 @Override
>> >> >>                 public void run() {
>> >> >>                      thumbFlowPane.getChildren().add(vBox);
>> >> >>                 }
>> >> >>             });
>> >> >>         }
>> >> >>         return null;
>> >> >>     }
>> >> >> };
>> >> >> new Thread(task).start();
>> >> >> 
>> >> >> And here is the tail of the output
>> >> >> ....
>> >> >> Point a:30
>> >> >> Point b:30
>> >> >> Point c:30
>> >> >> Point d:30
>> >> >> Point e:30
>> >> >> Point f:30
>> >> >> Point a:31
>> >> >> 
>> >> >> What is scratch file? Sorry, I don't understand you.
>> >> >
>> >> >PDFBox holds a lot of temporary data in the memory. To reduce the memory
>> >> >footprint one can choose to use a scratch file instead, so that some/most
>> >> >of
>> >> >that data will be hold in a file.
>> >> >
>> >> >To do so, simply use another load method, e.g. 
>> >> >
>> >> >load(File file, boolean useScratchFiles)
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> Среда,  1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler
<
>> >> >>  andreas@lehmi.de
>> >> >> >:
>> >> >> >
>> >> >> >
>> >> >> >> Alex Sviridov <  ooo_saturn7@mail.ru > hat am 1.
Juli 2015 um 12:58
>> >> >> >> geschrieben:
>> >> >> >> 
>> >> >> >> 
>> >> >> >>  Thank you for answer. I tried
>> >> >> >> pdfbox-app-2.0.0-20150630.220424-1464.jar
>> >> >> >> the
>> >> >> >> result is the same.
>> >> >> >> 
>> >> >> >> When I create images I add them to javafx FlowPane. However,
the
>> >> >> >> problem
>> >> >> >> is
>> >> >> >> not in images because I repeat - I get 400mb when I do
>> >> >> >> pdfDocument=null,pdfRenderer=null.
>> >> >> >> 
>> >> >> >> Bedised, when I do pdfDocument = PDDocument.load(new File(fileName))
>> >> >> >> I
>> >> >> >> don't
>> >> >> >> have any problems with memory. 
>> >> >> >> 
>> >> >> >> I'm getting problem with memory when I run in for loop
>> >> >> >> getPageThumbImage.
>> >> >> >> 
>> >> >> >> I am sure that the problem is in PdfBox. Please, help
me.
>> >> >> >Maybe, but I'm not sure at all.
>> >> >> >
>> >> >> >Try to use the scratch file.
>> >> >> >
>> >> >> >> Среда,  1 июля 2015, 12:48 +02:00 от Andreas
Lehmkühler <
>> >> >> >>  andreas@lehmi.de
>> >> >> >> >:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> Alex Sviridov <  ooo_saturn7@mail.ru >
hat am 1. Juli 2015 um
>> >> >> >> >> 10:16
>> >> >> >> >> geschrieben:
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >>  I want to display all page thumbnails. However
I came across
>> >> >> >> >> memory
>> >> >> >> >> size
>> >> >> >> >> problem with PDFRenderer or PDDocument - I don't
know which one. 
>> >> >> >> >> 
>> >> >> >> >> I have the following code:
>> >> >> >> >>    ....
>> >> >> >> >>     private PDDocument pdfDocument;
>> >> >> >> >>     
>> >> >> >> >>     private PDFRenderer pdfRenderer;
>> >> >> >> >> 
>> >> >> >> >>     public WritableImage getPageThumbImage(int
page){
>> >> >> >> >>         WritableImage result=null;
>> >> >> >> >>         try {
>> >> >> >> >>             BufferedImage bi=pdfRenderer.renderImageWithDPI(page,
>> >> >> >> >> 12,
>> >> >> >> >> ImageType.RGB);
>> >> >> >> >>             result=SwingFXUtils.toFXImage(bi,
null);
>> >> >> >> >>         } catch (IOException ex) {
>> >> >> >> >>              ....
>> >> >> >> >>         }
>> >> >> >> >>         return result;
>> >> >> >> >>     }
>> >> >> >> >>  .....
>> >> >> >> >> The method getPageThumbImage I run in for loop
for every page.I
>> >> >> >> >> set
>> >> >> >> >> java
>> >> >> >> >> memory heap to 500mb. 
>> >> >> >> >> And I can get about 30 images using getPageThumbImage
(if I set
>> >> >> >> >> more
>> >> >> >> >> memory
>> >> >> >> >> I
>> >> >> >> >> get more). 
>> >> >> >> >> In my application I have real time memory graphs
and they show
>> >> >> >> >> that
>> >> >> >> >> memory
>> >> >> >> >> is
>> >> >> >> >> very fast filled. 
>> >> >> >> >> When there is no more free memory getPageThumbImage
hangs - no
>> >> >> >> >> exception,
>> >> >> >> >> nothing. But the code stops.
>> >> >> >> >> When I do pdfDocument=null,pdfRenderer=null I
get about 400mb free
>> >> >> >> >> memory.
>> >> >> >> >> How
>> >> >> >> >> to solve this problem?
>> >> >> >> >There are 2 possible issues and maybe both are relevant.
>> >> >> >> >
>> >> >> >> >1. PDFBox consumes more or less memory to load a pdf
depending on
>> >> >> >> >the
>> >> >> >> >size
>> >> >> >> >and
>> >> >> >> >the content of the pdf.
>> >> >> >> >
>> >> >> >> >- Are you using the latest 2.0.0-SNAPSHOT? There were
some
>> >> >> >> >improvements
>> >> >> >> >concerning the memory footprint lately
>> >> >> >> >- Try to use of a scratch file (there are load methods
including a
>> >> >> >> >boolean
>> >> >> >> >switcht ot activate that)
>> >> >> >> >
>> >> >> >> >2. Your own implementation consumes more or less memory
to process
>> >> >> >> >those
>> >> >> >> >thumbnails
>> >> >> >> >
>> >> >> >> >- check if you are releasing all resources (ecspecially
those images
>> >> >> >> >you're
>> >> >> >> >creating) you are using during your process
>> >> >> >> >
>> >> >> >> >HTH,
>> >> >> >> >Andreas
>> >> >> >> >
>> >> >> >> >---------------------------------------------------------------------
>> >> >> >> >To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> >> >> >> >For additional commands, e-mail:  users-help@pdfbox.apache.org
>> >> >> >> >
>> >> >> >> 
>> >> >> >> 
>> >> >> >> -- 
>> >> >> >> Alex Sviridov
>> >> >> >
>> >> >> >BR
>> >> >> >Andreas
>> >> >> >
>> >> >> >---------------------------------------------------------------------
>> >> >> >To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> >> >> >For additional commands, e-mail:  users-help@pdfbox.apache.org
>> >> >> >
>> >> >> 
>> >> >> 
>> >> >> -- 
>> >> >> Alex Sviridov
>> >> >> 
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> >> >> For additional commands, e-mail:  users-help@pdfbox.apache.org
>> >> >
>> >> >
>> >> >BR
>> >> >Andreas
>> >> >
>> >> >---------------------------------------------------------------------
>> >> >To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> >> >For additional commands, e-mail:  users-help@pdfbox.apache.org
>> >> >
>> >> 
>> >> 
>> >> -- 
>> >> Alex Sviridov
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>> >For additional commands, e-mail:  users-help@pdfbox.apache.org
>> >
>> 
>> 
>> -- 
>> Alex Sviridov
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail:  users-unsubscribe@pdfbox.apache.org
>For additional commands, e-mail:  users-help@pdfbox.apache.org
>


-- 
Alex Sviridov
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message