pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msahy...@apache.org
Subject git commit: PDFBOX-2340 initial commit to pdfbox-docs
Date Fri, 17 Oct 2014 20:38:37 GMT
Repository: pdfbox-docs
Updated Branches:
  refs/heads/master [created] f93b45c38


PDFBOX-2340 initial commit to pdfbox-docs


Project: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/commit/f93b45c3
Tree: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/tree/f93b45c3
Diff: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/diff/f93b45c3

Branch: refs/heads/master
Commit: f93b45c38f8485f9ccbde7c6089e955f1776c246
Parents: 
Author: Maruan Sahyoun <sahyoun@fileaffairs.de>
Authored: Fri Oct 17 22:37:39 2014 +0200
Committer: Maruan Sahyoun <sahyoun@fileaffairs.de>
Committed: Fri Oct 17 22:37:39 2014 +0200

----------------------------------------------------------------------
 LICENSE                                         | 202 +++++++++++++++++++
 README.md                                       |  51 +++++
 .../attachments/workingwithattachments.md       |  69 +++++++
 .../documentcreation/documentcreation.md        |  52 +++++
 docs/1.8/cookbook/fonts/workingwithfonts.md     | 123 +++++++++++
 .../cookbook/metadata/workingwithmetadata.md    |  62 ++++++
 docs/1.8/cookbook/pdfa/pdfacreation.md          |  71 +++++++
 docs/1.8/cookbook/pdfa/pdfavalidation.md        |  81 ++++++++
 .../cookbook/textextraction/textextraction.md   |  96 +++++++++
 9 files changed, 807 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/LICENSE
----------------------------------------------------------------------
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..5af8995
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,202 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In  
+ legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e324651
--- /dev/null
+++ b/README.md
@@ -0,0 +1,51 @@
+Apache PDFBox Documentation
+===========================
+
+***We are moving our documentation sources over to this site. At this point in time the content
is limited.***
+
+The documentation project for [Apache PDFBox](http://pdfbox.apache.org/).
+
+The project provides the content for the documentation, which will be pulled from a build
process based on the [ASF Content Management System](http://www.apache.org/dev/cms.html).
+
+Documentation Format
+--------------------
+
+All of the [Apache PDFBox](http://pdfbox.apache.org/) documentation is written with [markdown](http://daringfireball.net/projects/markdown/syntax).

+
+Documentation Structure
+-----------------------
+
+    docs/
+    docs/VERSION
+    docs/VERSION/CHAPTER
+
+Contributing
+------------
+
+### Contribution Guidlines
+
+As a minimum requirement all contributions shall have the [Apache License](http://www.apache.org/licenses/LICENSE-2.0.html#apply)
header attached.
+
+For larger contributions or if you are looking to contribute regulary we ask you to sign
an [ICLA](http://www.apache.org/licenses/#clas).
+
+### Report or Fix an Issue
+
+We use [Apache JIRA](https://issues.apache.org/jira/browse/PDFBOX) as our tracking tool.
+
+### Using Git Pull Requests
+
+Pull requests are welcome!
+
+We appreciate the use of topic branches.
+
+    git checkout -b <JIRA issue number>
+
+    # update enhance the documentation
+
+    git commit -m "<JIRA issue number> This is my commit message."
+
+    git push origin <JIRA issue number>
+
+    # issue a pull request from your branch <JIRA issue number>
+    
+    # wait for a committer to commit your patch

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/attachments/workingwithattachments.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/attachments/workingwithattachments.md b/docs/1.8/cookbook/attachments/workingwithattachments.md
new file mode 100644
index 0000000..5c1315e
--- /dev/null
+++ b/docs/1.8/cookbook/attachments/workingwithattachments.md
@@ -0,0 +1,69 @@
+---
+license: Licensed to the Apache Software Foundation (ASF) under one
+         or more contributor license agreements.  See the NOTICE file
+         distributed with this work for additional information
+         regarding copyright ownership.  The ASF licenses this file
+         to you under the Apache License, Version 2.0 (the
+         "License"); you may not use this file except in compliance
+         with the License.  You may obtain a copy of the License at
+
+           http://www.apache.org/licenses/LICENSE-2.0
+
+         Unless required by applicable law or agreed to in writing,
+         software distributed under the License is distributed on an
+         "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+         KIND, either express or implied.  See the License for the
+         specific language governing permissions and limitations
+         under the License.
+         
+Title: Cookbook - Working with Attachments
+---
+
+# Working with Attachments
+
+## The PDF File Specification
+
+See package:org.apache.pdfbox.pdmodel.common.filespecification  
+See example:EmbeddedFiles  
+
+A PDF can contain references to external files via the file system or a URL to a remote 
+location. It is also possible to embed a binary file into a PDF document.
+
+There are two classes that can be used when referencing a file. PDSimpleFileSpecification

+is a simple string reference to a file(e.g. "./movies/BigMovie.avi"). The simple file 
+specification does not allow for any parameters to be set. 
+
+The PDComplexFileSpecification is more feature rich and allows for advanced settings on 
+the file reference.
+
+It is also possible to embed a file directly into a PDF. Instead of setting the file 
+attribute of the PDComplexFileSpecification, the EmbeddedFile attribute can be used instead.
+
+## Adding a File Attachment
+
+PDF documents can contain file attachments that are accessed from the Document->File Attachments

+menu. PDFBox allows attachments to be added to and extracted from PDF documents. 
+Attachments are part of the named tree that is attached to the document catalog.
+
+	:::java
+	PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
+
+	//first create the file specification, which holds the embedded file
+	PDComplexFileSpecification fs = new PDComplexFileSpecification();
+	fs.setFile( "Test.txt" );
+	InputStream is = ...;
+	PDEmbeddedFile ef = new PDEmbeddedFile(doc, is );
+	//set some of the attributes of the embedded file
+	ef.setSubtype( "test/plain" );
+	ef.setSize( data.length );
+	ef.setCreationDate( new GregorianCalendar() );
+	fs.setEmbeddedFile( ef );
+
+	//now add the entry to the embedded file tree and set in the document.
+	Map efMap = new HashMap();
+	efMap.put( "My first attachment", fs );
+	efTree.setNames( efMap );
+	//attachments are stored as part of the "names" dictionary in the document catalog
+	PDDocumentNameDictionary names = new PDDocumentNameDictionary( doc.getDocumentCatalog()
);
+	names.setEmbeddedFiles( efTree );
+	doc.getDocumentCatalog().setNames( names );
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/documentcreation/documentcreation.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/documentcreation/documentcreation.md b/docs/1.8/cookbook/documentcreation/documentcreation.md
new file mode 100644
index 0000000..173d61c
--- /dev/null
+++ b/docs/1.8/cookbook/documentcreation/documentcreation.md
@@ -0,0 +1,52 @@
+Title: Cookbook - Document Creation
+
+## Document Creation
+
+### Create a blank PDF
+
+This small sample shows how to create a new PDF document using PDFBox.
+
+	:::java
+	// Create a new empty document
+	PDDocument document = new PDDocument();
+		
+	// Create a new blank page and add it to the document
+	PDPage blankPage = new PDPage();
+	document.addPage( blankPage );
+		
+	// Save the newly created document
+	document.save("BlankPage.pdf");
+        
+	// finally make sure that the document is properly
+	// closed.
+	document.close();
+
+### Hello World using a PDF base font
+
+This small sample shows how to create a new document and print the text "Hello World" using
one of the PDF base fonts.
+
+	:::java
+	// Create a document and add a page to it
+	PDDocument document = new PDDocument();
+	PDPage page = new PDPage();
+	document.addPage( page );
+	
+	// Create a new font object selecting one of the PDF base fonts
+	PDFont font = PDType1Font.HELVETICA_BOLD;
+	
+	// Start a new content stream which will "hold" the to be created content
+	PDPageContentStream contentStream = new PDPageContentStream(document, page);
+	
+	// Define a text content stream using the selected font, moving the cursor and drawing the
text "Hello World"
+	contentStream.beginText();
+	contentStream.setFont( font, 12 );
+	contentStream.moveTextPositionByAmount( 100, 700 );
+	contentStream.drawString( "Hello World" );
+	contentStream.endText();
+	
+	// Make sure that the content stream is closed:
+	contentStream.close();
+	
+	// Save the results and ensure that the document is properly closed:
+	document.save( "Hello World.pdf");
+	document.close();
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/fonts/workingwithfonts.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/fonts/workingwithfonts.md b/docs/1.8/cookbook/fonts/workingwithfonts.md
new file mode 100644
index 0000000..1a717c1
--- /dev/null
+++ b/docs/1.8/cookbook/fonts/workingwithfonts.md
@@ -0,0 +1,123 @@
+Title: Cookbook - Working with Fonts
+
+## Working with Fonts
+
+### Standard 14 Fonts
+
+The PDF specification states that a standard set of 14 fonts will always be available when
consuming PDF documents. In PDFBox these are defined as constants in the PDType1Font class.
+
+| Standard Font | Description |
+| ------------- | ----------- |
+| PDType1Font.TIMES_ROMAN | Times regular |
+| PDType1Font.TIMES_BOLD | Times bold |
+| PDType1Font.TIMES_ITALIC | Times italic |
+| PDType1Font.TIMES_BOLD_ITALIC | Times bold italic |
+| PDType1Font.HELVETICA | Helvetica regular |
+| PDType1Font.HELVETICA_BOLD | Helvetica bold |
+| PDType1Font.HELVETICA_OBLIQUE | Helvetica italic |
+| PDType1Font.HELVETICA_BOLD_OBLIQUE | Helvetica bold italic | 
+| PDType1Font.COURIER | Courier |
+| PDType1Font.COURIER_BOLD | Courier bold |
+| PDType1Font.COURIER_OBLIQUE | Courier italic |
+| PDType1Font.COURIER_BOLD_OBLIQUE | Courier bold italic |
+| PDType1Font.SYMBOL | Symbol Set |
+| PDType1Font.ZAPF_DINGBATS | Dingbat Typeface |
+
+### Hello World using a PDF base font
+
+This small sample shows how to create a new document and print the text "Hello World" using
one of the PDF base fonts.
+
+	:::java
+	// Create a document and add a page to it
+	PDDocument document = new PDDocument();
+	PDPage page = new PDPage();
+	document.addPage( page );
+	
+	// Create a new font object selecting one of the PDF base fonts
+	PDFont font = PDType1Font.HELVETICA_BOLD;
+	
+	// Start a new content stream which will "hold" the to be created content
+	PDPageContentStream contentStream = new PDPageContentStream(document, page);
+	
+	// Define a text content stream using the selected font, moving the cursor and drawing the
text "Hello World"
+	contentStream.beginText();
+	contentStream.setFont( font, 12 );
+	contentStream.moveTextPositionByAmount( 100, 700 );
+	contentStream.drawString( "Hello World" );
+	contentStream.endText();
+	
+	// Make sure that the content stream is closed:
+	contentStream.close();
+	
+	// Save the results and ensure that the document is properly closed:
+	document.save( "Hello World.pdf");
+	document.close();
+
+### Hello World using a TrueType font
+
+This small sample shows how to create a new document and print the text "Hello World" using
a TrueType font.
+
+	:::java
+	// Create a document and add a page to it
+	PDDocument document = new PDDocument();
+	PDPage page = new PDPage();
+	document.addPage( page );
+	
+	// Create a new font object by loading a TrueType font into the document
+	PDFont font = PDTrueTypeFont.loadTTF(document, "Arial.ttf");
+	
+	// Start a new content stream which will "hold" the to be created content
+	PDPageContentStream contentStream = new PDPageContentStream(document, page);
+	
+	// Define a text content stream using the selected font, moving the cursor and drawing the
text "Hello World"
+	contentStream.beginText();
+	contentStream.setFont( font, 12 );
+	contentStream.moveTextPositionByAmount( 100, 700 );
+	contentStream.drawString( "Hello World" );
+	contentStream.endText();
+	
+	// Make sure that the content stream is closed:
+	contentStream.close();
+	
+	// Save the results and ensure that the document is properly closed:
+	document.save( "Hello World.pdf");
+	document.close();
+
+While it is recommended to embed all fonts for greatest portability not all PDF producer

+applications will do this. When displaying a PDF it is necessary to find an external font
to use. 
+PDFBox will look for a mapping file to use when substituting fonts.
+
+PDFBox will load Resources/PDFBox_External_Fonts.properties off of the classpath to map font
+names to TTF font files. The UNKNOWN_FONT property in that file will tell PDFBox which font
to 
+use when no mapping exists. 
+
+
+### Hello World using a Postscript Type1 font
+
+This small sample shows how to create a new document and print the text "Hello World" using
a Postscript Type1 font.
+
+	:::java
+	// Create a document and add a page to it
+	PDDocument document = new PDDocument();
+	PDPage page = new PDPage();
+	document.addPage( page );
+	
+	// Create a new font object by loading a Postscript Type 1 font into the document
+	PDFont font = new PDType1AfmPfbFont(doc,"cfm.afm");
+	
+	// Start a new content stream which will "hold" the to be created content
+	PDPageContentStream contentStream = new PDPageContentStream(document, page);
+	
+	// Define a text content stream using the selected font, moving the cursor and drawing the
text "Hello World"
+	contentStream.beginText();
+	contentStream.setFont( font, 12 );
+	contentStream.moveTextPositionByAmount( 100, 700 );
+	contentStream.drawString( "Hello World" );
+	contentStream.endText();
+	
+	// Make sure that the content stream is closed:
+	contentStream.close();
+	
+	// Save the results and ensure that the document is properly closed:
+	document.save( "Hello World.pdf");
+	document.close();
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/metadata/workingwithmetadata.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/metadata/workingwithmetadata.md b/docs/1.8/cookbook/metadata/workingwithmetadata.md
new file mode 100644
index 0000000..150600b
--- /dev/null
+++ b/docs/1.8/cookbook/metadata/workingwithmetadata.md
@@ -0,0 +1,62 @@
+Title: Cookbook - Working with Metadata
+
+## Working with Metadata
+
+### Introduction
+
+PDF documents can contain information describing the document itself or certain objects 
+within the document such as the author of the document or it's creation date. 
+Basic information can be set and retrieved using the PDDocumentInformation object.
+
+In addition to that more metadata can be retrieved using the XML metadata as decribed below.
+Getting basic Metadata
+
+To set or retrieve basic information about the document the PDDocumentInformation object

+provides a high level API to that information:
+
+	:::java
+    PDDocumentInformation info = document.getDocumentInformation();
+    System.out.println( "Page Count=" + document.getNumberOfPages() );
+    System.out.println( "Title=" + info.getTitle() );
+    System.out.println( "Author=" + info.getAuthor() );
+    System.out.println( "Subject=" + info.getSubject() );
+    System.out.println( "Keywords=" + info.getKeywords() );
+    System.out.println( "Creator=" + info.getCreator() );
+    System.out.println( "Producer=" + info.getProducer() );
+    System.out.println( "Creation Date=" + info.getCreationDate() );
+    System.out.println( "Modification Date=" + info.getModificationDate());
+    System.out.println( "Trapped=" + info.getTrapped() );      
+      
+
+### Accessing PDF Metadata
+
+See class:org.apache.pdfbox.pdmodel.common.PDMetadata  
+See example:AddMetadataFromDocInfo  
+See Adobe Documentation:XMP Specification  
+
+PDF documents can have XML metadata associated with certain objects within a PDF document.
+For example, the following PD Model objects have the ability to contain metadata:
+
+    PDDocumentCatalog
+    PDPage
+    PDXObject
+    PDICCBased
+    PDStream
+
+The metadata that is stored in PDF objects conforms to the XMP specification, it is 
+recommended that you review that specification. Currently there is no high level API for

+managing the XML metadata, PDFBox uses standard java InputStream/OutputStream to retrieve

+or set the XML metadata.
+
+	:::java
+	PDDocument doc = PDDocument.load( ... );
+    PDDocumentCatalog catalog = doc.getDocumentCatalog();
+    PDMetadata metadata = catalog.getMetadata();
+
+    //to read the XML metadata
+    InputStream xmlInputStream = metadata.createInputStream();
+
+    //or to write new XML metadata
+    InputStream newXMPData = ...;
+    PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
+    catalog.setMetadata( newMetadata );
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/pdfa/pdfacreation.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/pdfa/pdfacreation.md b/docs/1.8/cookbook/pdfa/pdfacreation.md
new file mode 100644
index 0000000..947e734
--- /dev/null
+++ b/docs/1.8/cookbook/pdfa/pdfacreation.md
@@ -0,0 +1,71 @@
+Title:     Create a valid PDF/A document
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+## PDF/A Creation
+
+The Apache PDFBox API can be used to create a PDF/A File. PDF/A is a PDF file with some constraints
to ensure its 
+long time conservation. These constraints are described in ISO 19005.
+
+This small sample shows what should be added during creation of a PDF file to transform it
in a valid PDF/A 
+document. The current example creates a valid PDF/A-1b document.
+
+### Load all the fonts used in document
+
+The PDF/A specification enforces that the fonts used in the document are present in the PDF
File. You
+have to load them. As an example:
+
+	:::java
+	InputStream fontStream = CreatePDFA.class.getResourceAsStream("/org/apache/pdfbox/resources/ttf/ArialMT.ttf");
+	PDFont font = PDTrueTypeFont.loadTTF(doc, fontStream);
+
+### Including XMP metadata block
+
+It is imposed to have xmp metadata defined in the PDF. At least, the PDFA Schema (giving
details on the version
+of PDF/A specification reached by the document) must be present. These lines create the xmp
metadata for a
+PDF/A-1b document:
+
+	:::java
+	XMPMetadata xmp = new XMPMetadata();
+	XMPSchemaPDFAId pdfaid = new XMPSchemaPDFAId(xmp);
+	xmp.addSchema(pdfaid);
+	pdfaid.setConformance("B");
+	pdfaid.setPart(1);
+	pdfaid.setAbout("");
+	metadata.importXMPMetadata(xmp);
+
+### Including color profile
+
+It is mandatory to include the color profile used by the document. Different profiles can
be used. This 
+example takes one present in pdfbox:
+
+	:::java
+	// create output intent
+	InputStream colorProfile = CreatePDFA.class.getResourceAsStream("/org/apache/pdfbox/resources/pdfa/sRGB
Color Space Profile.icm");
+	PDOutputIntent oi = new PDOutputIntent(doc, colorProfile); 
+	oi.setInfo("sRGB IEC61966-2.1"); 
+	oi.setOutputCondition("sRGB IEC61966-2.1"); 
+	oi.setOutputConditionIdentifier("sRGB IEC61966-2.1"); 
+	oi.setRegistryName("http://www.color.org"); 
+	cat.addOutputIntent(oi);
+
+### Complete example
+
+The complete example can be found in pdfbox-example. The source file is
+
+	src/main/java/org/apache/pdfbox/examples/pdfa/CreatePDFA.java
+

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/pdfa/pdfavalidation.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/pdfa/pdfavalidation.md b/docs/1.8/cookbook/pdfa/pdfavalidation.md
new file mode 100644
index 0000000..ce55120
--- /dev/null
+++ b/docs/1.8/cookbook/pdfa/pdfavalidation.md
@@ -0,0 +1,81 @@
+Title: Cookbook - PDF/A Validation
+
+## PDF/A Validation
+
+The Apache Preflight library is a Java tool that implements a parser compliant with the ISO-19005
specification (aka PDF/A-1).
+Check Compliance with PDF/A-1b
+
+This small sample shows how to check the compliance of a file with the PDF/A-1b specification.
+
+	:::java
+    ValidationResult result = null;
+
+    FileDataSource fd = new FileDataSource(args[0]);
+    PreflightParser parser = new PreflightParser(fd);
+    try
+    {
+
+        /* Parse the PDF file with PreflightParser that inherits from the NonSequentialParser.
+         * Some additional controls are present to check a set of PDF/A requirements. 
+         * (Stream length consistency, EOL after some Keyword...)
+         */
+        parser.parse();
+
+        /* Once the syntax validation is done, 
+         * the parser can provide a PreflightDocument 
+         * (that inherits from PDDocument) 
+         * This document process the end of PDF/A validation.
+         */
+        PreflightDocument document = parser.getPreflightDocument();
+        document.validate();
+
+        // Get validation result
+        result = document.getResult();
+        document.close();
+
+    }
+    catch (SyntaxValidationException e)
+    {
+        /* the parse method can throw a SyntaxValidationException 
+         * if the PDF file can't be parsed.
+         * In this case, the exception contains an instance of ValidationResult  
+         */
+        result = e.getResult();
+	}
+
+	// display validation result
+    if (result.isValid())
+    {
+        System.out.println("The file " + args[0] + " is a valid PDF/A-1b file");
+	}
+    else
+    {
+        System.out.println("The file" + args[0] + " is not valid, error(s) :");
+        for (ValidationError error : result.getErrorsList())
+        {
+            System.out.println(error.getErrorCode() + " : " + error.getDetails());
+        }
+	}
+      	
+### Categories of Validation Error
+
+If a validation fails, the ValidationResult object contains all causes of the failure.
+In order to help in the failure understanding, all error codes have the following form X[.Y[.Z]]
where :
+
+ - 'X' is the category (ex : Font validation error...)
+ - 'Y' represent a subsection of the category (ex : "Font with Glyph error")
+ - 'Z' represent the cause of the error (ex : "Font with a missing Glyph")
+
+Category ('Y') and cause ('Z') may be missing according to the difficulty to identify the
error detail.
+
+Here after, you can find all Categories (for detailed cause, see constants in the PreglihtConstant
interface) :
+
+| Category | Description |
+| -------- | ----------- | 
+| 1[.y[.z]] | Syntax Error |
+| 2[.y[.z]] | Graphic Error |
+| 3[.y[.z]] | Font Error |
+| 4[.y[.z]] | Transparency Error |
+| 5[.y[.z]] | Annotation Error |
+| 6[.y[.z]] | Action Error |
+| 7[.y[.z]] | Metadata Error |

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f93b45c3/docs/1.8/cookbook/textextraction/textextraction.md
----------------------------------------------------------------------
diff --git a/docs/1.8/cookbook/textextraction/textextraction.md b/docs/1.8/cookbook/textextraction/textextraction.md
new file mode 100644
index 0000000..1662d7b
--- /dev/null
+++ b/docs/1.8/cookbook/textextraction/textextraction.md
@@ -0,0 +1,96 @@
+Title: Cookbook - Textextraction
+
+## Textextraction
+
+### Extracting Text
+
+See class:org.apache.pdfbox.util.PDFTextStripper  
+See class:org.apache.pdfbox.searchengine.lucene.LucenePDFDocument  
+See command line app:ExtractText  
+
+One of the main features of PDFBox is its ability to quickly and accurately extract text

+from a variety of PDF documents. This functionality is encapsulated in the 
+org.apache.pdfbox.util.PDFTextStripper and can be easily executed on the command line with

+org.apache.pdfbox.ExtractText.
+
+### Lucene Integration
+
+Lucene is an open source text search library from the Apache Jakarta Project. In order for
+Lucene to be able to index a PDF document it must first be converted to text. PDFBox provides

+a simple approach for adding PDF documents into a Lucene index.
+
+	:::java
+	Document luceneDocument = LucenePDFDocument.getDocument( ... );
+          
+Now that you hava a Lucene Document object, you can add it to the Lucene index just like

+you would if it had been created from a text or HTML file. The LucenePDFDocument automatically

+extracts a variety of metadata fields from the PDF to be added to the index, the javadoc

+shows details on those fields. This approach is very simple and should be sufficient for

+most users, if not then you can use some of the advanced text extraction techniques 
+described in the next section.
+
+### Advanced Text Extraction
+
+Some applications will have complex text extraction requiments and neither the command 
+line application nor the LucenePDFDocument will be able to fulfill those requirements. 
+It is possible for users to utilize or extend the PDFTextStripper class to meet some of 
+these requirements.
+
+#### Limiting The Extracted Text
+
+There are several ways that we can limit the text that is extracted during the extraction

+process. The simplest is to specify the range of pages that you want to be extracted. 
+For example, to only extract text from the second and third pages of the PDF document 
+you could do this:
+
+	:::java
+    PDFTextStripper stripper = new PDFTextStripper();
+    stripper.setStartPage( 2 );
+    stripper.setEndPage( 3 );
+    stripper.writeText( ... );
+        
+NOTE: The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.
+
+If you wanted to start on page 2 and extract to the end of the document then you would just
+set the startPage property. By default all pages in the pdf document are extracted.
+
+It is also possible to limit the extracted text to be between two bookmarks in the page.

+If you are not familiar with how to use bookmarks in PDFBox then you should review the 
+Bookmarks page. Similar to the startPage/endPage properties, PDFTextStripper also has 
+startBookmark/endBookmark properties. There are some caveats to be aware of when using this
+feature of the PDFTextStripper. Not all bookmarks point to a page in the current PDF document.

+
+The possible states of a bookmark are:
+
+ - null - The property was not set, this is the default.
+ - Points to page in the PDF - The property was set and points to a valid page in the PDF
+ - Bookmark does not point to anything - The property was set but the bookmark does not point
to any page
+ - Bookmark points to external action - The property was set, but it points to a page in
a different PDF or performs an action when activated
+
+The table below will describe how PDFBox behaves in the various scenarios:
+
+| Start Bookmark | End Bookmark | Result |
+| -------------- | ------------ | ------ |
+| null | null | This is the default, the properties have no effect on the text extraction.
|
+| Points to a page in the PDF | null | Text extraction will begin on the page that this bookmark
points to and go until the end of the document. |
+| null | Points to a page in the PDF | Text extraction will begin on the first page and stop
at the end of the page that this bookmark points to. |
+| Bookmark does not point to anything | null | Because the PDFTextStripper cannot determine
a start page based on the bookmark, it will start on the first page and go until the end of
the document. |
+| null | Bookmark does not point to anything | Because the PDFTextStripper cannot determine
a end page based on the bookmark, it will start on the first page and go until the end of
the document. |
+| Bookmark does not point to anything | Bookmark does not point to anything | This is a special
case! If the startBookmark and endBookmark are exactly the same then no text will be extracted.
If they are different then it is not possible for the PDFTextStripper to determine that pages
so it will include the entire document. | 
+| Bookmark points to external action | Bookmark points to external action | If either the
startBookmark or the endBookmark refer to an external page or execute an action then an OutlineNotLocalException
will be thrown to indicate to the user that the bookmark is not valid. |
+
+NOTE: PDFTextStripper will check both the startPage/endPage and the startBookmark/endBookmark
to determine if text should be extracted from the current page.
+
+#### External Glyph List
+
+Some PDF files need to map between glyph names and Unicode values during text extraction.

+PDFBox comes with an Adobe Glyph List, but you may encounter files with glyph names that

+are not in that map. To use your own glyphlist file, supply the file name to the ``glyphlist_ext``
JVM property.
+
+#### Right to Left Text
+
+Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew)
+in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text
+if the ICU4J jar file has been placed on the classpath (it is an optional dependency). 
+Note that you should also enable sorting with either org.apache.pdfbox.util.PDFTextStripper

+or org.apache.pdfbox.ExtractText to ensure accurate output.


Mime
View raw message