Apache PDFBox - PDFBox - PDF File References

Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list commits@pdfbox.apache.org Received: (qmail 22778 invoked by uid 99); 16 Feb 2010 19:37:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 19:37:58 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 19:37:44 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 245A52388A56; Tue, 16 Feb 2010 19:37:22 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r910660 [5/5] - in /pdfbox/site/publish: ./ commandlineutilities/ css/ images/ images/logos/ userguide/ Date: Tue, 16 Feb 2010 19:37:20 -0000 To: commits@pdfbox.apache.org From: jukka@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20100216193722.245A52388A56@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: pdfbox/site/publish/userguide/file_references.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/file_references.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/file_references.html (added) +++ pdfbox/site/publish/userguide/file_references.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,311 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - PDF File References + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

PDF File Specification

+ See package:org.apache.pdfbox.pdmodel.common.filespecification
+ + See example:EmbeddedFiles

+ A PDF can contain references to external files via the file system or a URL to a remote location. + It is also possible to embed a binary file into a PDF document. +

+ There are two classes that can be used when referencing a file. + PDSimpleFileSpecification + is a simple string reference to a file(e.g. "./movies/BigMovie.avi"). The simple file specification does not allow for any parameters to be + set. The PDComplexFileSpecification + is more feature rich and allows for advanced settings on the file reference. +

+ It is also possible to embed a file directly into a PDF. Instead of setting the file attribute of the PDComplexFileSpecification, the + EmbeddedFile attribute can be used instead. +

File Attachments

+ PDF documents can contain file attachments that are accessed from the Document->File Attachments menu. PDFBox allows attachments + to be added to and extracted from PDF documents. Attachments are part of the named tree that is attached to the document catalog. +

+        PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
+
+        //first create the file specification, which holds the embedded file
+        PDComplexFileSpecification fs = new PDComplexFileSpecification();
+        fs.setFile( "Test.txt" );
+        InputStream is = ...;
+        PDEmbeddedFile ef = new PDEmbeddedFile(doc, is );
+        //set some of the attributes of the embedded file
+        ef.setSubtype( "test/plain" );
+        ef.setSize( data.length );
+        ef.setCreationDate( new GregorianCalendar() );
+        fs.setEmbeddedFile( ef );
+
+        //now add the entry to the embedded file tree and set in the document.
+        Map efMap = new HashMap();
+        efMap.put( "My first attachment", fs );
+        efTree.setNames( efMap );
+        //attachments are stored as part of the "names" dictionary in the document catalog
+        PDDocumentNameDictionary names = new PDDocumentNameDictionary( doc.getDocumentCatalog() );
+        names.setEmbeddedFiles( efTree );
+        doc.getDocumentCatalog().setNames( names );
+

+ +

+ + + Propchange: pdfbox/site/publish/userguide/file_references.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/fonts.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/fonts.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/fonts.html (added) +++ pdfbox/site/publish/userguide/fonts.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,320 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - PDF Fonts + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

tandard 14 Fonts

+ The PDF specification states that a standard set of 14 fonts will always be available when consuming + PDF documents. In PDFBox these are defined as constants in the PDType1Font class. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Standard Font
PDType1Font.TIMES_ROMAN
PDType1Font.TIMES_BOLD
PDType1Font.TIMES_ITALIC
PDType1Font.TIMES_BOLD_ITALIC
PDType1Font.HELVETICA
PDType1Font.HELVETICA_BOLD
PDType1Font.HELVETICA_OBLIQUE
PDType1Font.HELVETICA_BOLD_OBLIQUE
PDType1Font.COURIER
PDType1Font.COURIER_BOLD
PDType1Font.COURIER_OBLIQUE
PDType1Font.COURIER_BOLD_OBLIQUE
PDType1Font.SYMBOL
PDType1Font.ZAPF_DINGBATS

TrueType Fonts

Embedding TrueType Fonts

+ PDFBox supports embedding TrueType fonts. Loading a new font is easy. +

+      PDDocument doc = PDDocument.load( ... );
+      PDFont font = PDTrueTypeFont.loadTTF( doc, new File( "SpecialFont.ttf" ) );

External TrueType Fonts

+ While it is recommended to embed all fonts for greatest portability not all PDF producer applications + will do this. When displaying a PDF it is necessary to find an external font to use. + PDFBox will look for a mapping file to use when substituting fonts.
+
+ + PDFBox will load Resources/PDFBox_External_Fonts.properties off of the classpath to map + font names to TTF font files. The UNKNOWN_FONT property in that file will tell PDFBox which font + to use when no mapping exists. +

+ +

+ + + Propchange: pdfbox/site/publish/userguide/fonts.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/highlighting.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/highlighting.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/highlighting.html (added) +++ pdfbox/site/publish/userguide/highlighting.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - PDF Highlighting + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

Highlighting text in a PDF

+ There are cases when you might want to highlight text in a PDF document. For example, if the PDF is the result + of a search request you might want to highlight the word in the resulting PDF document. There are several ways + this can be achieved, each method varying in complexity and flexibility. +

1. Use the 'search' open parameter

+ Acrobat supports passing is various parameters that tell it what to do once the PDF is open. + See PDF Open Parameters for + documentation on all the open parameters. One of the parameters is the 'search' parameter, this will automatically run the search + functionality inside of Acrobat once the PDF is open. For example: http://pdfbox.apache.org/userguide/text_extraction.pdf#search="check"
+
+The words must be enclosed in quotes and separated by spaces; for example:#search="pdfbox rocks" + This is a great solution because of its simplicity! It doesn't require PDFBox at all, but it is a potential solution that + many developers are not aware of. +

2. Generate a highlight XML document

+ Acrobat also allows you to tell it to highlight specific words in the PDF document. It does this by passing an XML document to + Acrobat when opening the PDF. + See the PDF Highlight File Format + for more detailed documentation.
+
+ + Basically the document allows you to tell it the characters to highlight in the PDF by using character + offsets on a page. As this is just an XML document, there are many ways you could create it but PDFBox does have a utility to make it + easier. Take a look at the javadoc for the PDFHighlighter class. This will + allow you specify a set of words that you want have highlighted and generate the XML document for you.
+
+ + PDFBox also ships with a complete + web application example of using this class, take a look at the pdfbox.war directory in your PDFBox installation. +
+ + You pass the xml to acrobat through a URL (or command line) parameter like this: + http://pdfbox.apache.org/userguide/text_extraction.pdf#xml=http://pdfbox.apache.org/highlight.xml
+The value of the xml parameter must be a full URL to the XML document.
+ + http://pdfbox.apache.org/userguide/text_extraction.pdf#xml=highlight.xml will not work
+ + http://pdfbox.apache.org/userguide/text_extraction.pdf#xml=http://pdfbox.apache.org/highlight.xml is correct!
+The one drawback to this solution is that you must parse the PDF and then generate an XML document, which is a time consuming operation. +

3. Alter pdf contents to highlight specific text

+ Using PDFBox it is possible to regenerate the appearance stream to add highlighting to specific areas. While this is possible, + it will require recreating a new PDF for every search request. There is nothing prebuilt in PDFBox to do this automatically for you + and will require a significant coding effort.
+
+ + You would need to +

Find all locations of the text, determine x/y coordinates, width/height
Regenerate the PDF appearance stream and draw a highlighted box behind the text. Yellow would be easiest, if you want an inverted black/white, then you would need to change the color of the text to be white and draw a black box.
Stream the PDF back to the user

+ + This is the most flexible but is also the most work to implement and is also more resource intensive. +

+ +

+ + + Propchange: pdfbox/site/publish/userguide/highlighting.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/index.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/index.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/index.html (added) +++ pdfbox/site/publish/userguide/index.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,399 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - User Guide + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

PDFBox User Guide

+ This page will discuss the internals of PDF documents + and how those internals map to PDFBox classes. + Users should reference the javadoc to see what classes and methods are available. The + Adobe PDF Reference + can be used to determine detailed information about fields and their meanings. +

Examples

A variety of examples can be found in the + src/main/java/org/apache/pdfbox/examples folder. + This guide will refer to specific examples as needed. +

PDF File Format Overview

+ A PDF document is a stream of basic object types. The low level objects are represented in PDFBox + in the org.apache.pdfbox.cos package. The basic types in a PDF are: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

PDF Type	Description	Example	PDFBox class
Array	An ordered list of items	[1 2 3]	org.apache.pdfbox.cos.COSArray
Boolean	Standard True/False values	true	org.apache.pdfbox.cos.COSBoolean
Dictionary	A map of name value pairs	<< + + /Type /XObject + + /Name (Name) + + /Size 1 + + >> +	org.apache.pdfbox.cos.COSDictionary
Number	Integer and Floating point numbers	1 2.3	org.apache.pdfbox.cos.COSFloat +org.apache.pdfbox.cos.COSInteger
Name	A predefined value in a PDF document, typically used as a key in a dictionary	/Type	org.apache.pdfbox.cos.COSName
Object	A wrapper to any of the other objects, this can be used to reference an object multiple times. + An object is referenced by using two numbers, an object number and a generation number. Initially + the generation number will be zero unless the object got replaced later in the stream. +	12 0 obj << /Type /XObject >> endobj	org.apache.pdfbox.cos.COSObject
Stream	A stream of data, typically compressed. This is used for page contents, images and + embedded font streams. +	12 0 obj << /Type /XObject >> stream 030004040404040404 endstream	org.apache.pdfbox.cos.COSStream
String	A sequence of characters +	(This is a string)	org.apache.pdfbox.cos.COSString

+ A page in a pdf document is represented with a COSDictionary. The entries that are available for + a page can be seen in the PDF Reference and an example of a page looks like this: +

+ + +

+<<
+    /Type /Page
+    /MediaBox [0 0 612 915]
+    /Contents 56 0 R
+>>

Some Java code to access fields

+ + +

COSDictionary page = ...;
+COSArray mediaBox = (COSArray)page.getDictionaryObject( "MediaBox" );
+System.out.println( "Width:" + mediaBox.get( 3 ) );
+

PD Model

The COS Model allows access to all aspects of a PDF document. This type of programming is + tedious and error prone though because the user must know all of the names of the parameters + and no helper methods are available. The PD Model was created to help alleviate this problem. + Each type of object(page, font, image) has a set of defined attributes that can be available + in the dictionary. A PD Model class is available for each of these so that strongly typed + methods are available to access the attributes. The same code from above to get the page width + can be rewritten to use PD Model classes. +

+ + +

PDPage page = ...;
+PDRectangle mediaBox = page.getMediaBox();
+System.out.println( "Width:" + mediaBox.getWidth() );

PD Model objects sit on top of COS model. Typically, the classes in the PD Model + will only store a COS object and all setter/getter methods will modify data that + is stored in the COS object. For example, when you call PDPage.getLastModified() the method + will do a lookup in the COSDictionary with the key "LastModified", if it is found the value is + then converter to a java.util.Calendar. When PDPage.setLastModified( Calendar ) is called + then the Calendar is converted to a string in the COSDictionary. +

Here is a visual depiction of the COS Model and PD Model design.

+ This design presents many advantages and disadvantages.
+
+Advantages:

Simple, easy to use API.
Underlying document automatically gets updated when you update the PD Model
Ability to easily access the COS Model from any PD Model object
Easily add to and update existing PDF documents

Disadvantages:

Object caching is not done in the PD Model classes + For example, each call to PDPage.getMediaBox() will return a new PDRectangle + object, but will contain the same underlying COSArray.

+ +

+ + + Propchange: pdfbox/site/publish/userguide/index.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/metadata.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/metadata.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/metadata.html (added) +++ pdfbox/site/publish/userguide/metadata.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,295 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - PDF Metadata + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

Accessing PDF Metadata

+ See class:org.apache.pdfbox.pdmodel.common.PDMetadata
+ + See example:AddMetadataFromDocInfo
+ + See Adobe Documentation:XMP Specification

+ PDF documents can have XML metadata associated with certain objects within a PDF document. For example, the following PD Model objects + have the ability to contain metadata: +

PDDocumentCatalog
PDPage
PDXObject
PDICCBased
PDStream

The metadata that is stored in PDF objects conforms to the XMP specification, it is recommended that you review that specification. + Currently there is no high level API for managing the XML metadata, PDFBox uses standard java InputStream/OutputStream + to retrieve or set the XML metadata. For example:

+      PDDocument doc = PDDocument.load( ... );
+      PDDocumentCatalog catalog = doc.getDocumentCatalog();
+      PDMetadata metadata = catalog.getMetadata();
+
+      //to read the XML metadata
+      InputStream xmlInputStream = metadata.createInputStream();
+
+      //or to write new XML metadata
+      InputStream newXMPData = ...;
+      PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
+      catalog.setMetadata( newMetadata );
+

+ +

+ + + Propchange: pdfbox/site/publish/userguide/metadata.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/redistribution.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/redistribution.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/redistribution.html (added) +++ pdfbox/site/publish/userguide/redistribution.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,315 @@ + + + + + + + + + + + + + + + + Apache PDFBox - PDFBox - Redistribution + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

Redistributing PDFBox

+ PDFBox makes use of several open source libraries. Some are just required for building PDFBox, some are required for running PDFBox. + The below table summarizes the licences that are included with PDFBox and when they are required. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Product (with license link)	Used for	Required for PDFBox redistribution
Adobe AFM	Resource files for extracting font encoding. Bundled inside the PDFBox jar file	Yes
Adobe CMap	Resource files for CJK font mapping. Bundled inside the PDFBox jar file	Yes
Adobe Glyphlist	Mapping for the computation of a Unicode character string from a sequence of glyphs. Bundled inside the PDFBox jar file	Yes
Apache Ant	Tool for building PDFBox	No
bouncycastle	Encryption libraries for encrypting/decrypting PDF documents	Yes
FontBox (incubating)	Font Library	Yes
JempBox (incubating)	Library for working with XMP metadata.	Yes
IKVM	Library for .NET version of PDFBox	Only if using .NET version(the DLLs in /bin) of PDFBox
junit	Testing framework used in development	No
Apache Lucene	Text search engine library. PDFBox provides simple integration with Lucene.	Optional, only if using Lucene
ICU4J	Normalizing right to left text.	Optional, only if extracting right to left text

+ +

+ + + Propchange: pdfbox/site/publish/userguide/redistribution.html ------------------------------------------------------------------------------ svn:eol-style = native Added: pdfbox/site/publish/userguide/text_extraction.html URL: http://svn.apache.org/viewvc/pdfbox/site/publish/userguide/text_extraction.html?rev=910660&view=auto ============================================================================== --- pdfbox/site/publish/userguide/text_extraction.html (added) +++ pdfbox/site/publish/userguide/text_extraction.html Tue Feb 16 19:37:14 2010 @@ -0,0 +1,367 @@ + + + + + + + + + + + + + + + + Apache PDFBox - Java PDF Library, pdftotext, PDF to text, java pdf text extraction + + + + + +

+ + + + + + + + +

+ + + + + + + +

+ + + + + + + + +

About

+ Welcome +
+ Download +
+ License +
+ Mailing Lists +
+ Issue Tracker +
+ References +
+ ASF Sponsorship Program +
+ ASF Thanks +

Command Line Utilities

+ Index +
+ Decrypt +
+ Encrypt +
+ ExtractText +
+ PDFToImage +
+ PrintPDF +
+ ConvertColorspace +
+ TextToPDF +

Developers Guide

+ Index +
+ Bookmarks +
+ Building PDFBox +
+ FAQ +
+ File References +
+ Fonts +
+ Highlighting +
+ Metadata +
+ Redistribution +
+ .NET Version +
+ Text Extraction +

Project Documentation

+ Project Information +

+ +

+ + + + + + + + + +

Extracting Text

+ See class:org.apache.pdfbox.util.PDFTextStripper
+ + See class:org.apache.pdfbox.searchengine.lucene.LucenePDFDocument
+ + See command line app:ExtractText
+

+ One of the main features of PDFBox is its ability to quickly and accurately extract text from a variety of PDF documents. + This functionality is encapsulated in the org.apache.pdfbox.util.PDFTextStripper and + can be easily executed on the command line with org.apache.pdfbox.ExtractText. +

Lucene Integration

Lucene is an open source text search library from the Apache Jakarta Project. + In order for Lucene to be able to index a PDF document it must first be converted to text. PDFBox provides a simple approach for adding + PDF documents into a Lucene index.

+          Document luceneDocument = LucenePDFDocument.getDocument( ... );
+

+ Now that you hava a Lucene Document object, you can add it to the Lucene index just like you would if it had been + created from a text or HTML file. + The LucenePDFDocument automatically extracts + a variety of metadata fields from the PDF to be added to the index, the javadoc shows details on those fields. + This approach is very simple and should be sufficient for most users, if not then you can use some of the advanced text extraction + techniques described in the next section. +

Advanced Text Extraction

Some applications will have complex text extraction requiments and neither the command line application nor the LucenePDFDocument + will be able to fulfill those requirements. It is possible for users to utilize or extend the + PDFTextStripper class to meet some of these requirements.

+Limiting The Extracted Text

+ There are several ways that we can limit the text that is extracted during the extraction process. The simplest is to + specify the range of pages that you want to be extracted. For example, to only extract text from the second and third pages + of the PDF document you could do this: +

+            PDFTextStripper stripper = new PDFTextStripper();
+            stripper.setStartPage( 2 );
+            stripper.setEndPage( 3 );
+            stripper.writeText( ... );
+

+The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.

If you wanted to start on page 2 and extract to the end of the document then you would just set the startPage property. + By default all pages in the pdf document are extracted.

It is also possible to limit the extracted text to be between two bookmarks in the page. If you are not familiar with + how to use bookmarks in PDFBox then you should review the Bookmarks page. Similar to the startPage/endPage + properties, PDFTextStripper also has startBookmark/endBookmark properties. There are some caveats to be aware of when using this + feature of the PDFTextStripper. Not all bookmarks point to a page in the current PDF document. The possible states of a bookmark are:

null - The property was not set, this is the default.
Points to page in the PDF - The property was set and points to a valid page in the PDF
Bookmark does not point to anything - The property was set but the bookmark does not point to any page
Bookmark points to external action - The property was set, but it points to a page in a different PDF or performs an action when activated

The table below will describe how PDFBox behaves in the various scenarios:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Start Bookmark	End Bookmark	Result
null	null	This is the default, the properties have no effect on the text extraction.
Points page in the PDF	null	Text extraction will begin on the page that this bookmark points to and go until the end of the document.
null	Points page in the PDF	Text extraction will begin on the first page and stop at the end of the page that this bookmark points to.
Bookmark does not point to anything	null	Because the PDFTextStripper cannot determine a start page based on the bookmark, it will start on the first page and go until + the end of the document.
null	Bookmark does not point to anything	Because the PDFTextStripper cannot determine a end page based on the bookmark, it will start on the first page and go until + the end of the document.
Bookmark does not point to anything	Bookmark does not point to anything	This is a special case! If the startBookmark and endBookmark are exactly the same then no text will be extracted. If + they are different then it is not possible for the PDFTextStripper to determine that pages so it will include the + entire document.
Bookmark points to external action	Bookmark points to external action	If either the startBookmark or the endBookmark refer to an external page or execute an action then an OutlineNotLocalException + will be thrown to indicate to the user that the bookmark is not valid.

+PDFTextStripper will check both the startPage/endPage and the startBookmark/endBookmark to determine if text should + be extracted from the current page.

External Glyph List

Some PDF files need to map between glyph names and Unicode values during text extraction. PDFBox comes with an Adobe Glyph List, but you may encounter files with glyph names that are not in that map. To use your own glyphlist file, supply the file name to the glyphlist_ext JVM property.

Right to Left Text

Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew) in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text if the ICU4J jar file has been placed on the classpath (it is an optional dependency). Note that you should also enable sorting with either org.apache.pdfbox.util.PDFTextStripper or org.apache.pdfbox.ExtractText to ensure accurate output.

+ +

+ + + Propchange: pdfbox/site/publish/userguide/text_extraction.html ------------------------------------------------------------------------------ svn:eol-style = native