pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4337) Could extract all elements(Text, Image, Table, etc) dynamically in sequence from pdf file
Date Thu, 11 Oct 2018 06:35:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646033#comment-16646033

Tilman Hausherr commented on PDFBOX-4337:

There is no such thing in PDFBox and we're not thinking about doing this. PDF is very complex
(it's not like HTML) and there are many ways to do the same thing, for example what you think
is an "image" might also be a vector graphic, or a puzzle of 1000 tiny images; a "table" is
usually a vector graphic, i.e. there is no table concept in PDF; an 99% identical text can
be made of 100% different character codes due to subsetting. To understand what I mean, open
a PDF file in PDFDebugger and look around. Also read the [PDF specification|https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf].
There are a few "PDF to XML" projects on GitHub but I doubt this will really help you, it
moves the problem elsewhere. The only thing that may help is the TestPDFToImage.java test
in the source code, this shows how to compare two PDF renderings.

> Could extract all elements(Text, Image, Table, etc) dynamically in sequence from pdf
> ------------------------------------------------------------------------------------------
>                 Key: PDFBOX-4337
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4337
>             Project: PDFBox
>          Issue Type: Wish
>            Reporter: RuhongCai
>            Priority: Major
>         Attachments: sample_pdf.pdf
> We are trying to compare two pdf files in run time and detect the "insertion" , "deletion",
"modification" between two files.
> PDFBOx works well for "extract Text for two files", but it is not enough for us,
> Does any api in pdfbox or any workaround way to "read/extract" all component(Table, image,Text,
etc) from pdf files in sequence and return some related useful information.
> The attached is sample file which contains Text, Table, image, not-well format.  Read
element/component in sequence
> could do further comparison work. 
> [^sample_pdf.pdf]
> Many thanks!

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message