pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Wilson <williamstonconsult...@gmail.com>
Subject Re: Bug-Fix for Scratch File Bug
Date Mon, 29 Aug 2011 19:08:42 GMT
Stefan,

License considerations severely limit what we can do with patches provided
via the mailing list.

Would you please create an issue at
https://issues.apache.org/jira/browse/PDFBOX ?  When you attach your patch
files, please check the box that grants us the right to use the files.

May sound silly ... but ... it keeps everything legal!

Thanks!

Daniel

On Sat, Aug 27, 2011 at 5:04 PM, Stefan Mücke <s.muecke@devsup.de> wrote:

> Hi PDFBox comitters,
>
> I would like to contribute a bug fix for a long-standing, major problem in
> PDFBox.
>
> PDFBox uses a scratch file to reduce memory consumption. However, there is
> no mechanism that prevents two PDStreams from writing to the scratch file at
> the same time. When this happens, the resulting PDF contains garbage in some
> streams. This problem occurred to me several times (e.g. when writing to an
> image stream while constructing a page).
>
> Reproducing the bug
> *******************
>
> One can easily reproduce the bug. Open file AddImageToPDF.java and move the
> following line:
>
>    PDPageContentStream contentStream =
>        new PDPageContentStream(doc, page, true, true);
>
> immediately after the line in which the PDPage object is fetched:
>
>    PDPage page =
>        (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
>
> With this modification, one will still get a PDF file, but Acrobat Reader
> will report that the image could not be processed. BTW, the files
> AddImageToPDF.java and ImageToPDF.java are almost identical. One of them
> should be deleted.
>
> Bug-Fix
> *******
>
> The problem can be solved by using a scratch file that is divided into
> pages (e.g. of 4 KB). Each PDStream in the scratch file is then associated
> with a list of pages. This list grows as more data is written to the stream.
>
> The bug fix requires minimal changes to the existing code. The very nice
> RandomAccess interface made this very easy.
>
> Here is what needs to be changed:
>
>    - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package
>    - Change COSDocument.getScratchFile() to return a RandomAccess
>      instance provided by PagedMultiRandomAccessFile:
>
>        private PagedMultiRandomAccessFile scratchFile = null;
>
>        [...]
>
>        public COSDocument(File scratchDir) throws IOException {
>                tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir);
>                scratchFile = new PagedMultiRandomAccessFile(
>                        new RandomAccessFile(tmpFile, "rw"));
>        }
>
>        public COSDocument(RandomAccess file) {
>                // scratchFile = file;
>                throw new RuntimeException("Not yet implemented.");
> //$NON-NLS-1$
>        }
>
>        [...]
>
>        /**
>         * Returns a new scratch file.
>         *
>         * @return the newly created scratch file
>         */
>        public RandomAccess getScratchFile() {
>                return scratchFile.getNewRandomAcess();
>        }
>
> One of the COSDocument constructors takes a RandomAccess file. This
> constructor is only called in a single location, namely, in method
> PDFParser.parse(). I am not sure if the RandomAccess parameter provided here
> is really a scratch file. Someone will have to decide what to do with this
> one.
>
> The code has been throughly tested and has been used in the production of
> several books without any problems.
>
> In the attachment please find the code. There is also a JUnit test that was
> used to debug my code. I have added an Apache license header and adopted
> PDFBox's code style. Feel free to make any desired changes.
>
> Best regards,
>
> Stefan Mücke
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message