pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] Commented: (PDFBOX-7) extract information from tagged PDF
Date Sun, 07 Mar 2010 17:49:27 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842456#action_12842456
] 

Andreas Lehmkühler commented on PDFBOX-7:
-----------------------------------------

Johannes can you rebuild the patch if possible as one file, please? I've tried to apply the
attached patches and it leads to doubled up code and compile errors. The patches seem to be
somehow mixed up.

> extract information from tagged PDF
> -----------------------------------
>
>                 Key: PDFBOX-7
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-7
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>         Attachments: PDFBOX-7_patch_00.txt, PDFBOX-7_patch_01.txt, PDFBOX-7_patch_02.txt,
PDFBOX-7_patch_03.txt, PDFMarkedContentExtractor.properties
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=805623
> Originally submitted by benlitchfield on 2003-09-13 07:38.
> Add the ability to extract information from a tagged PDF 
> document.  See taggedPDF.pdf for an example.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> Hi,
> we have to parse the PDF object structure tree; all
> structural elements are inside the object tree (see e.g.
> PDFReference 1.4 chapter 9.6 "Logical Structure").
> - parse the PDF page streams to extract drawing and text
> operations;these contain the actual content of the
> structural elements. This content is surrounded by BMC/EMC
> tags which contain information to which element object the
> contained content belongs.This is what i got from pdf reference.
> Regards,
> Qumar.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
> would be a good form to start with.
> If you notice they are putting labels on the form fields.  
> these labels contain meta data critical to building tax 
> software in rapid fashion.  Without this meta data, the 
> name of the form field is meaningless. It would be nice to 
> extract this information so I can combine it with other 
> data about the field (name, type, location, etc).  I 
> already know PDFBox can extract the other information about 
> the fields.  I haven't done it with PDFBox, but I did it 
> with iText.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> More comments from users
> Tagged PDF will be a big thing in government because 
> federal government procurement of Acrobat publishing 
> technology falls under Section 508.  States will likely 
> follow.
>  see:
> www.section508.gov
> http://www.irs.gov/pub/irs-access/
> or
> ftp://ftp.irs.gov/pub/irs-access/
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> Hi,
>  i was seeing the specification of pdf and came to know the
> structure information of pdf will be in PDSEdit
> layer,PDSEdit Layer gives access to structure tree with in a
> pdf and methods methods and objects are prefixed by PDS.So
> how can we get access to PDSEdit layer of pdf.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> It would be nice if pdfbox can provide the ability to
> extract information from tagged PDF.As Adobre Acrobat Reader
> provides the tags for the pdf, pdfbox should also try to get
> the tagged pdfs.
> for example if iwe have a pdf file with a para1 under
> header1 and para2 under header 2 and a table with rows and
> columns.something like 
>  
> Header1 
> This is a para 1 ,it describes about a disease.  
> Header2 
> This is a para2,describes remedies of disease. 
> Table 
> A B  
> C D 
>  
>  
> Now the tagged pdf looks like below in adobe acrobat reader
>  
> <Heading 1> 
> Header1 
> <Normal>  
> This is a para 1 ,it describes about a disease. 
> <Heading 1> 
> Header1 
> <Normal>  
> This is a para2,describes remedies of disease. 
> <Heading 1> 
> Table 
> <Table> 
> <TBody> 
> <TR> 
> <TD> 
> <Normal> 
> A 
> <TD> 
> <Normal> 
> B 
> <TR> 
> <TD> 
> <Normal> 
> C 
> <TD> 
> <Normal> 
> D 
> how can we extract the Heading1 ,Heading 2 and tabular data
> using pdfbox.
> This is a good feature which should be added to the armory
> pdfbox.
> Please provide this feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message