pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4567) Contribution of PDF Linearization
Date Mon, 24 Jun 2019 12:53:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871183#comment-16871183
] 

Jonathan commented on PDFBOX-4567:
----------------------------------

Just wondering, are you interested in integrating this? The algorithm is working well, it's
just a matter of exposing the API in a way that makes sense for you.

Right now, we do the following:
 * We use an own parser we proposed in PDFBOX-4542 to avoid having to load all object streams
into memory. I'll still have to review if your recent implementation of the on-demand parser
makes this superfluous, but I'm afraid it doesn't. We need to catalogue and rearrange all
objects contained in the pdf, so we are going to walk through the entire pdf object tree,
which in my understanding will trigger a complete parse.
 * After having parsed the document, we currently have a class PDFLinearizer which takes a
COSDocument instance. Here it doesn't matter which parser we use. This class then applies
a reimplementation of the QPDF Linearization algorithm. First, we use a PDFOptimizer to determine
the order of the objects and to flatten out inherited attributes, then we write the entire
PDF and do length calculations. In contrast to the implementation by QPDF, we do not write
everything twice, one time into an empty pipe, the second time to a real file, instead, we
subclassed COSWriter, mainly giving it the ability to reset itself, but giving us the written
content. 
This way, we can avoid rewriting the great bulk of the pdf file, we can just store most objects
in a datastructure and write them later. The only stuff that really needs to be written twice
are the cross reference streams and the linearization dictionary.
 * PDFLinearizer will return a WrittenObjectStore, which has the ability to serialize itself
to a pdf file. We used this mechanism as we never actually write the file, instead we are
storing it in a object the emulates an InputStream, transparently resolving the references
created by our parser modification. This is a rather special usecase for us, for the public
we could potentially add two serializers, one of which writes the file to disk, another one
would create a COSDocument/PDFDocument.
 * Similarly to QPDF, we don't support outline or thumb hint tables. But it shouldn't be too
hard to implement actually as all the information is already there.

Regardless whether you do or don't wish to put the work into integrating this, I would appreciate
an answer.

> Contribution of PDF Linearization
> ---------------------------------
>
>                 Key: PDFBOX-4567
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4567
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities, Writing
>            Reporter: Jonathan
>            Priority: Major
>
> We've finally gotten the approval to publish our pdf linearization. How should we do
it? I thought about publishing our current source as a fork of PDFBox on Github, then we could
discuss about the best way to integrate the module into your API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message