pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Ringer (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (PDFBOX-1256) [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class
Date Wed, 14 Mar 2012 06:29:45 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229017#comment-13229017
] 

Craig Ringer edited comment on PDFBOX-1256 at 3/14/12 6:29 AM:
---------------------------------------------------------------

Patch is against https://svn.apache.org/repos/asf/pdfbox/trunk@1299358

Diffstat:

 .../org/apache/pdfbox/util/PDFStreamEngine.java    |  191 ++++++++-----------
 .../org/apache/pdfbox/util/PDFStreamProcessor.java |  202 ++++++++++++++++++++
 .../pdfbox/util/operator/OperatorProcessor.java    |   35 ++--
 .../util/operator/OperatorProcessorHandler.java    |   43 ++++
 4 files changed, 346 insertions(+), 125 deletions(-)
 create mode 100644 pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamProcessor.java
 create mode 100644 pdfbox/src/main/java/org/apache/pdfbox/util/operator/OperatorProcessorHandler.java
                
      was (Author: ringerc):
    Diffstat:

 .../org/apache/pdfbox/util/PDFStreamEngine.java    |  191 ++++++++-----------
 .../org/apache/pdfbox/util/PDFStreamProcessor.java |  202 ++++++++++++++++++++
 .../pdfbox/util/operator/OperatorProcessor.java    |   35 ++--
 .../util/operator/OperatorProcessorHandler.java    |   43 ++++
 4 files changed, 346 insertions(+), 125 deletions(-)
 create mode 100644 pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamProcessor.java
 create mode 100644 pdfbox/src/main/java/org/apache/pdfbox/util/operator/OperatorProcessorHandler.java

                  
> [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base
class
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1256
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1256
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 1.7.0, 2.0.0
>         Environment: N/A
>            Reporter: Craig Ringer
>            Priority: Minor
>              Labels: api, refactoring
>         Attachments: 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch
>
>
> The attached patch restructures PDFStreamEngine to move the basic functionality of invoking
callbacks for each operator in a stream into a parent class. The parent class knows nothing
about the meaning of operators, it just invokes handlers with accumulated arguments whenever
it encounters an operator. PDFStreamEngine retains all the "knowledge" of what those operators
mean, the state of the graphics state stack, etc.
> The purpose of the change is to make it simpler and easier to use PDFBox's PDF stream
processor/parser code without dealing with the full features of PDFStreamEngine with its built-in
operator handlers, awareness of the graphics stack, etc when that functionality isn't required.
Specifically, I needed to write a tool that copies a PDF stream, renaming resource references
as it goes but otherwise leaving it unchanged. I wanted to handle all operators including
future or unknown ones, and only needed to special-case a couple of them. PDFStreamEngine
was poorly suited to that because it doesn't support a default handler fallback, tries to
"understand" the stream, etc. Rather than write a new class that duplicated much of PDFStreamEngine
I thought I'd try to factor the required functionality out, so others could use it too.
> The changes should be backward compatible with existing code that uses PDFStreamEngine.
No changes in any PDFStreamEngine clients in PDFBox were required for the test suite to pass,
text extraction tool to work, etc. Nonetheless, it's possible you'll only consider these changes
for inclusion in PDFBox 2.0, in which case they can be cleaned up to remove some of the backward
compatibility crap that's currently in them. Let me know.
> In terms of open issues or TODOs, the class naming could probably use work. I can't rename
PDFStreamEngine or OperatorProcessor for backward compatibility reasons, so I've had to come
up with more contrived names than I'd like.
> The logic of the changes is:
> - Move content stream argument accumulation and operator callback functionality into
new PDFStreamProcessor class
> - Add support for a default (fallback) handler to PDFStreamProcessor so operators not
explicitly matched may be handled
> - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its existing methods
though some are now inherited.
> - Deprecate the properties-map based configuration of PDFStreamEngine because it'll be
fragile whenever more than one classloader is in use. Add PDFStreamProcessor.replaceOperatorProcessors(...)
for equivalent functionality using a type-safe, multi-classloader-safe HashMap of operator
names to handler instances. This isn't added as a ctor override because operator handler registration/unregistration
methods are not final (to preserve compatibility with PDFStreamEngine) and if overridden,
they might use data from a not-yet-initialized derived class. If a ctor override is required
then registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.
> - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). Instances
of this are bound to a particular PDFStreamEngine via the `context' property and they carry
state when they don't have to. They're also an abstract class, so handlers can't extend any
other class. OperatorProcessor based handlers continue to be supported just fine via a simple
wrapper that's used automatically where required.
> - Introduce new OperatorProcessorHandler interface to replace OperatorProcessor . It's
a simple one-method interface that passes the PDFStreamProcessor as an argument, so application
designers are free to choose whether to tie their OperationProcessorHandler implementations
to PDFStreamProcessor instances or whether they want to re-use the same handler on many different
processors. This change is useful for my app and removes unnecessary stateful API, but isn't
strictly necessary and can be dropped while retaining the PDFStreamEngine / PDFStreamProcessor
split. As part of the API change, new-interface handlers are passed the original arguments
array rather than a copy; if they want a copy of the arguments array they have to take it
themselves, so that resources aren't wasted copying the array when handlers don't actually
need it copied.
> - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor implementations
are wrapped in a helper that translates OperatorProcessorHandler interface usage to the usage
required by OperatorProcessor. All the wrapper does is set the context (which PDFStreamEngine
seems to do before every handler call) then pass a copy of the arguments array.
> I'm aware that this is a non-trivial change I'm proposing, but I think it significantly
improves the API (especially once the BC stuff can be removed for PDFBox 2.0) and makes it
easier to use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message