corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Fisher <>
Subject Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog
Date Wed, 04 Feb 2015 21:49:50 GMT
Yes, it is interesting to me. I know that PDF is a markup that is based on a set of PostScript
functions and an object layout specification. It is not like PNG - that's a raster bitmap.
It is a vector drawing spec. My interest is pulling out the content - both text and shapes
into a useful set of objects. I am not so interested at this time in other features like forms,
embedded files, and output.

I can read the PDF into an object structure and output HTML5. I can also output the objects
into roughly equivalent PPTX slides using Apache POI.

Corinthia comes in two ways for me.

(1) An HTML5 format that is targeting interchange with Office Document formats.

(2) An intermediate format the may be exported in any format that makes sense.

So I am looking for Corinthia to allow pluggable DocFormats.


On Feb 4, 2015, at 11:13 AM, Louis S wrote:

> Louis
>> On 4 Feb 2015, at 13:55, jan i <> wrote:
>>> On 4 February 2015 at 19:51, Louis S <> wrote:
>>> I posted on this to see if pdfbox could offer insight s it is taken up.
>>> Dave pointed out that the functionality of pdfbox ws interesting to his
>>> company.
>> And I think your posting was interesting information (such information is
>> needed to see what moves out there). But I do not think we currently should
>> think about putting it into Corinthia.
> No objections.
>> rgds
>> jan i.
>>> Louis
>>>> On 4 Feb 2015, at 12:03, jan i <> wrote:
>>>> On Wednesday, February 4, 2015, Peter Kelly <> wrote:
>>>>>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <
>>>>> <javascript:;>> wrote:
>>>>>> Does this have anything to do with Corinthia? No. Corinthia is about
>>>>> content and especially word processing formats (OOXML, ODF etc.)..
>>>>> Corinthia is at its core about pragmatic fidelity. The point of the
>>>>> bidirectional transformation model is to be able to reduce fidelity
>>>>> demands. Unless the project wants to get sidetracked into HiFi rendering
>>>>> (of DOCX or ODT) it's completely outside of the scope….
>>>>> I think of PDF in the same way as I do PNG. It’s intended as an output
>>>>> format, not an input format. I know there are tools out there which are
>>>>> effectively half of an OCR system which can reconstruct a source
>>> document
>>>>> by inferring the logical structure from the layout (e.g. where a
>>> paragraph
>>>>> begins and ends), though this is quite a difficult problem and I’m
>>> sure
>>>>> that it’d be within the scope of Corinthia (though if someone has ideas
>>> on
>>>>> this and wants to work on it, I’m all for it - it’s just a very
>>> difficult
>>>>> and very different task to writing filters for all the other formats
>>> we’ve
>>>>> discussed).
>>>> +1 I think we currently have other more important tasks in corinthia.
>>>> rgds
>>>> jan i
>>>>> On the other side is output to PDF - that is, typesetting. This is
>>>>> something I also think would be outside the scope of the project (at
>>> least
>>>>> based on my understanding of people’s interests to date). We basically
>>> rely
>>>>> on separate programs to do the typesetting of a document produced by
>>>>> library, e.g. LaTeX, WebKit/other browser engines.
>>>>> --
>>>>> Dr. Peter M. Kelly
>>>>> <javascript:;>
>>>>> PGP key: <
>>>>> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>>>> --
>>>> Sent from My iPad, sorry for any misspellings.

View raw message