pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olaf Drümmer <olafl...@callassoftware.com>
Subject Re: how to create structure for an existing PDF document
Date Thu, 12 Mar 2015 11:32:44 GMT
Hi Klaus,

what kind of structure do you wish to create? Structure in the sense of tagged PDF, or just
some logical structure, and if so, for what purposes?

Olaf


On 12 Mar 2015, at 11:54, "Henning, Klaus" <KHenning@eitco.de> wrote:

> Hi,
> 
> we want to create the structure to an existing PDF document. We have PDF documents from
a scanner which contains Images but no structure.
> We want to implement a program to create the structure so we can add AlternateDescriptions
to the images based on tesaract ocr recognition.
> 
> Our first approach creates a structure but the structure seems to be incomplete when
checking it with adobe acrobat. We can't find any hints in the pdfbox examples
> or documentation how to do this.
> 
> Our Code snippet:
> 
>             try {
>                    PDDocument document = PDDocument.load("test.pdf");
>                    PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
> 
>                    PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
> 
>                    if(treeRoot == null){
>                           COSDictionary cosDictionary = documentCatalog.getCOSDictionary();
>                           PDStructureTreeRoot newTreeRoot = new PDStructureTreeRoot();
> 
>                           //iterate over pages
>                           List<?> pages = documentCatalog.getAllPages();
>                           for (Object object : pages) {
>                                  PDPage page = (PDPage) object;
>                                  Map<String,PDXObject> mapObjects = page.getResources().getXObjects();
>                                  for (PDXObject pdxObject : mapObjects.values()) {
>                                        if(pdxObject instanceof PDXObjectImage){
>                                               PDXObjectImage objectImage = (PDXObjectImage)pdxObject;
>                                               //new SturctureElement for the image
>                                               PDStructureElement structureElement = new
PDStructureElement(StandardStructureTypes.Figure,newTreeRoot);
>                                               PDMarkedContent markedContent = new PDMarkedContent(COSName.IMAGE,
 new COSDictionary());
>                                               markedContent.addXObject(objectImage);
>                                               structureElement.appendKid(markedContent);
>                                               structureElement.setAlternateDescription("NEW
ALTERNATE DESCRIPTION");
>                                               newTreeRoot.appendKid(structureElement);
>                                        }
>                                  }
>                           }
> 
>                           documentCatalog.setStructureTreeRoot(newTreeRoot);
>                           treeRoot = documentCatalog.getStructureTreeRoot();
>                    }
> 
>                    document.save("testWithTree.pdf");
>                    document.close();
>             }
>             catch (IOException e) {
>                    e.printStackTrace();
>             }
>             catch (COSVisitorException e) {
>                    e.printStackTrace();
>             }
> 
> Can someone help us her?
> 
> Best regards,
> 
> Klaus Henning
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message