pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Mücke" <s.mue...@devsup.de>
Subject Re: Bug-Fix for Scratch File Bug
Date Mon, 29 Aug 2011 21:24:07 GMT
> License considerations severely limit what we can do with patches 
> provided via the mailing list.
> 
> Would you please create an issue at
> https://issues.apache.org/jira/browse/PDFBOX ?  When you attach your 
> patch files, please check the box that grants us the right to use the 
> files. 

Okay, but I have trouble finding an "Add/New/Create/Report issue" button. Do I need to have
an account? I don't really want to create one.

Stefan


> May sound silly ... but ... it keeps everything legal!
> 
> Thanks!
> 
> Daniel
> 
> On Sat, Aug 27, 2011 at 5:04 PM, Stefan Mücke <s.muecke@devsup.de> wrote:
> 
> > Hi PDFBox comitters,
> >
> > I would like to contribute a bug fix for a long-standing, major problem 
> > in PDFBox.
> >
> > PDFBox uses a scratch file to reduce memory consumption. However, there 
> > is no mechanism that prevents two PDStreams from writing to the 
> > scratch file at the same time. When this happens, the resulting PDF 
> > contains garbage in some streams. This problem occurred to me several 
> > times (e.g. when writing to an image stream while constructing a 
> >page). 
> > Reproducing the bug
> > *******************
> >
> > One can easily reproduce the bug. Open file AddImageToPDF.java and move 
> > the following line:
> >
> >    PDPageContentStream contentStream =
> >        new PDPageContentStream(doc, page, true, true);
> >
> > immediately after the line in which the PDPage object is fetched:
> >
> >    PDPage page =
> >        (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
> >
> > With this modification, one will still get a PDF file, but Acrobat 
> > Reader will report that the image could not be processed. BTW, the 
> > files AddImageToPDF.java and ImageToPDF.java are almost identical. One 
> > of them should be deleted.
> >
> > Bug-Fix
> > *******
> >
> > The problem can be solved by using a scratch file that is divided into
> > pages (e.g. of 4 KB). Each PDStream in the scratch file is then 
> > associated with a list of pages. This list grows as more data is 
> >written to the stream. 
> > The bug fix requires minimal changes to the existing code. The very 
> > nice RandomAccess interface made this very easy.
> >
> > Here is what needs to be changed:
> >
> >    - Add the attached "PagedMultiRandomAccessFile.java" to the I/O 
> >    package - Change COSDocument.getScratchFile() to return a 
> >      RandomAccess instance provided by PagedMultiRandomAccessFile:
> >
> >        private PagedMultiRandomAccessFile scratchFile = null;
> >
> >        [...]
> >
> >        public COSDocument(File scratchDir) throws IOException {
> >                tmpFile = File.createTempFile("pdfbox", "tmp", 
> >                scratchDir); scratchFile = new 
> >                        PagedMultiRandomAccessFile( new 
> >        RandomAccessFile(tmpFile, "rw")); }
> >
> >        public COSDocument(RandomAccess file) {
> >                // scratchFile = file;
> >                throw new RuntimeException("Not yet implemented.");
> > //$NON-NLS-1$
> >        }
> >
> >        [...]
> >
> >        /**
> >         * Returns a new scratch file.
> >         *
> >         * @return the newly created scratch file
> >         */
> >        public RandomAccess getScratchFile() {
> >                return scratchFile.getNewRandomAcess();
> >        }
> >
> > One of the COSDocument constructors takes a RandomAccess file. This
> > constructor is only called in a single location, namely, in method
> > PDFParser.parse(). I am not sure if the RandomAccess parameter provided 
> > here is really a scratch file. Someone will have to decide what to do 
> > with this one.
> >
> > The code has been throughly tested and has been used in the production 
> > of several books without any problems.
> >
> > In the attachment please find the code. There is also a JUnit test that 
> > was used to debug my code. I have added an Apache license header and 
> > adopted PDFBox's code style. Feel free to make any desired changes.
> >
> > Best regards,
> >
> > Stefan Mücke
> >
> >
> 





Mime
View raw message