pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Mücke" <s.mue...@devsup.de>
Subject Re: Bug-Fix for Scratch File Bug
Date Mon, 29 Aug 2011 21:48:05 GMT
Here's the issue with the attached code:
https://issues.apache.org/jira/browse/PDFBOX-1109


> > License considerations severely limit what we can do with patches 
> > provided via the mailing list.
> > 
> > Would you please create an issue at
> > https://issues.apache.org/jira/browse/PDFBOX ?  When you attach your 
> > patch files, please check the box that grants us the right to use the 
> > files. 
> 
> Okay, but I have trouble finding an "Add/New/Create/Report issue" button. 
> Do I need to have an account? I don't really want to create one.
> 
> Stefan
> 
> 
> > May sound silly ... but ... it keeps everything legal!
> > 
> > Thanks!
> > 
> > Daniel
> > 
> > On Sat, Aug 27, 2011 at 5:04 PM, Stefan Mücke <s.muecke@devsup.de> 
> > wrote: 
> > > Hi PDFBox comitters,
> > >
> > > I would like to contribute a bug fix for a long-standing, major 
> > > problem  in PDFBox.
> > >
> > > PDFBox uses a scratch file to reduce memory consumption. However, 
> > > there  is no mechanism that prevents two PDStreams from writing to 
> > > the  scratch file at the same time. When this happens, the resulting 
> > > PDF  contains garbage in some streams. This problem occurred to me 
> > > several  times (e.g. when writing to an image stream while 
> > >constructing a  page). 
> > > Reproducing the bug
> > > *******************
> > >
> > > One can easily reproduce the bug. Open file AddImageToPDF.java and 
> > > move  the following line:
> > >
> > >    PDPageContentStream contentStream =
> > >        new PDPageContentStream(doc, page, true, true);
> > >
> > > immediately after the line in which the PDPage object is fetched:
> > >
> > >    PDPage page =
> > >        (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
> > >
> > > With this modification, one will still get a PDF file, but Acrobat 
> > > Reader will report that the image could not be processed. BTW, the 
> > > files AddImageToPDF.java and ImageToPDF.java are almost identical. 
> > > One  of them should be deleted.
> > >
> > > Bug-Fix
> > > *******
> > >
> > > The problem can be solved by using a scratch file that is divided 
> > > into pages (e.g. of 4 KB). Each PDStream in the scratch file is 
> > > then  associated with a list of pages. This list grows as more data 
> > >is  written to the stream. 
> > > The bug fix requires minimal changes to the existing code. The very 
> > > nice RandomAccess interface made this very easy.
> > >
> > > Here is what needs to be changed:
> > >
> > >    - Add the attached "PagedMultiRandomAccessFile.java" to the I/O 
> > >    package - Change COSDocument.getScratchFile() to return a 
> > >      RandomAccess instance provided by PagedMultiRandomAccessFile:
> > >
> > >        private PagedMultiRandomAccessFile scratchFile = null;
> > >
> > >        [...]
> > >
> > >        public COSDocument(File scratchDir) throws IOException {
> > >                tmpFile = File.createTempFile("pdfbox", "tmp", 
> > >                scratchDir); scratchFile = new 
> > >                        PagedMultiRandomAccessFile( new 
> > >        RandomAccessFile(tmpFile, "rw")); }
> > >
> > >        public COSDocument(RandomAccess file) {
> > >                // scratchFile = file;
> > >                throw new RuntimeException("Not yet implemented.");
> > > //$NON-NLS-1$
> > >        }
> > >
> > >        [...]
> > >
> > >        /**
> > >         * Returns a new scratch file.
> > >         *
> > >         * @return the newly created scratch file
> > >         */
> > >        public RandomAccess getScratchFile() {
> > >                return scratchFile.getNewRandomAcess();
> > >        }
> > >
> > > One of the COSDocument constructors takes a RandomAccess file. This
> > > constructor is only called in a single location, namely, in method
> > > PDFParser.parse(). I am not sure if the RandomAccess parameter 
> > > provided  here is really a scratch file. Someone will have to decide 
> > > what to do  with this one.
> > >
> > > The code has been throughly tested and has been used in the 
> > > production  of several books without any problems.
> > >
> > > In the attachment please find the code. There is also a JUnit test 
> > > that  was used to debug my code. I have added an Apache license 
> > > header and  adopted PDFBox's code style. Feel free to make any 
> > >desired changes. 
> > > Best regards,
> > >
> > > Stefan Mücke
> > >
> > >
> > 
> 
> 
> 
> 





Mime
View raw message