corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis E. Hamilton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COR-31) Identification of Document Format Tool Progressions: Access, Creation, Testing, Assessment, Validation, Forensics
Date Sat, 17 Jan 2015 01:38:34 GMT

    [ https://issues.apache.org/jira/browse/COR-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281117#comment-14281117
] 

Dennis E. Hamilton commented on COR-31:
---------------------------------------

I wondered off into the forensic cases and did not address the validation, analysis, and assessment
bit.

There are some validators for ODF documents.  The ODF Toolkit has one.  Although those are
Java based, it can probably be used and even adapted if we choose.  I know there are libraries
and tools with respect to OPC and probably OOXML, and POI may have some of that.  There are
also OOXML Toolkits, but not intended so much for validation and analysis, but those are worth
looking at just to identify available and known techniques.

Validators can also be adapted for assessment of documents.  That is, what is used in them
and how does that impact some interoperable case?

There's another form of assessment.  That has to do with the level of support there is for
a format in, say, DocFormat, and what is done with the unsupported or under-supported parts.
 In processing, one might simply indicate that there are features that are not preserved or
at-least not presented and get on with it.  But a tool that is more specific about that is
valuable for document-assessment purposes as well as our own analysis, trouble-shooting, and
a support tool for super-users.  It can also matter when assessing test documents and especially
documents that are contrived to break out of the envelope of what is supported.

I would think that as we ratchet up, there are more opportunities for companion tools and
also useful libraries in all of these areas.

> Identification of Document Format Tool Progressions: Access, Creation, Testing, Assessment,
Validation, Forensics
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: COR-31
>                 URL: https://issues.apache.org/jira/browse/COR-31
>             Project: Corinthia
>          Issue Type: Task
>            Reporter: Dennis E. Hamilton
>              Labels: document-forensics, document-format, document-standards, document-validation,
documents, file-structure, test-suite
>
> There are many needs, and opportunities, for command-line and library-level tools that
support the development of processors for different document formats.  
> Many small tools can be developed as part of the application and verification of what
will be larger solutions with regard to particular formats.  
> This task is for identification of which such tools will be defined as work-product and
deliverables for Corinthia, even in an initial provisional list.  Having an identified structure
points for defined deliverables should aid in having different aspects of Corinthia available
for development and testing by many hands and eyes.
> SKETCH
> There are different levels of tools, and the layers provide fixtures for exercising lower
layers of code and also composing them into layers above.
> To be concrete, here is a sketch of the levels of tooling that can be byproducts and
aids in the confirmation of correct handling of a document format.
> There are two "raw" formats that are handled in building document files of interest to
us: text files and Zip packages (or other carriers of composite structures, such as MIME multi-part,
tar files, Microsoft DocFiles, etc.).  
> There are flat file formats atop text-file formats.  Examples are Microsoft RTF, XML,
and HTML.  These are accompanied by character-set encoding variations that must be dealt with.
 There are also cases of linking that arise in these formats.
> RTF is a document format.  XML carries document formats such as the single-file ODF format,
the single-file XML formats defined for Microsoft Office, etc.  There are already HTML-format
usages that provide for fidelity preservation in round trip between HTML and Microsoft Office
formats.  There may be something similar that has lived in OpenOffice.org.  These are very
handy formats for creation of simple test documents that exercise the respective document
models.  They also provide experience with the document formats and efforts to abstract the
document that is represented in those formats.
> Zip usage as carriers raises its own needs for well-defined tools, both for use in the
inspection of document files but also the validation and forensic analysis of the Zip usage
for ODF, OOXML, and other formats, such as ePub.  Now we're dealing with composite document
files with multiple parts using flat formats, such as HTML and XML, and other formats, including
binary formats not mentioned as part of this progressive layering.  There are now more elaborate
structures to abstract from the parts of the Zip package and the cross-references among them.
> These are all tooling opportunities and they support the testing and confirmation of
the development of the document-processing functions that Corinthia makes available.
> The richness of this can be illustrated by the need for forensic and validation tools
and how they may become interdependent.
> Consider the simple verification of a Zip file.  There are two levels of verification
that matter.  
> First there is of the fundamental invariant structure that a Zip archive must possess.
 In practical use, it is desirable to rapidly abstract the presence of a correct Zip and its
components.  It is desirable to be able to produce or update one efficiently.   One wants
a fail-safe and resilient response when an unacceptable Zip is encountered.
> At the same time, one wants a way to assess and inspect a Zip that is well-formed or
is considered defective.  A separate tool would be handier for that, but needed to support
document processing by providing inspection and reporting of how the Zip is unacceptable.
 That's more involved and not something one wants to endure just to get going working with
a document.  At the same time, there is a good case for some reused common code as well, and
these kinds of tools aid in the confirmation of that code too.
> Suppose a Zip is concluded to be damaged.  Another level is goes beyond detection of
damage to determination of how much of the Zip can be recovered and what to do with the areas
of damage.  This is about rescuing documents.  Yet another opportunity.  Yet another elaborate
use that can involve some shared underlying code.
> We're now at the second level and that intersects with the use of a Zip as a particular
kind of document container.  A zip may be well-formed, but there are additional limitations
and functions that go into recognizing the Zip usage as a carrier of a particular document
format.  It can even be a generic carrier format, such as the Open Packaging Conventions (OPC)
used for carrying OOXML, XPS, and other artifacts, and the OpenDocument 1.2 Package used for
carrying ODF.
> There need to be analysis and inspection tools at this second level of generic Zip usage.
 This also has a cross-over value in the forensic problem of recovering what is recoverable
in a damaged Zip archive.  When it is known what additional structure is expected to be present,
this can inform the identification of breakage and determination of loss.
> It's not all one-sided.  What appears to be a well-formed Zip package for a given document
format can still expose damage in the recording or compression (oh yes, compression and decompression)
of any of its parts.
> This sketch is still at the plumbing level.  The abstraction of document features is
yet to happen.  That's raising up another level.
> This is all just to point out how many opportunities for tools and supporting libraries
there are. The tools are important for bootstrapping up the levels of Corinthia and for being
able to check our own work, to devise tests and demonstrations, and to provide forensic support
in the face of problems that may arise in the software or simply in circumstances that arise
for users.
> The idea behind this task and its subtasks is to see what could be identified as point
deliverables, even if fundamentally for our own work process, so that they become definable
and something to work on, to be available in higher levels of operation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message