incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "PDFBoxProposal" by BenLitchfield
Date Wed, 14 Nov 2007 21:59:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by BenLitchfield:

  === Abstract ===
+ PDFBox is a Java library for extracting content and manipulating PDF documents.
  === Proposal ===
+ PDFBox has basic content extraction and manipulation features.  A need has been shown in
the community to create advanced data extraction and to provide high level API for PDF creation.
  === Background ===
+ The PDFBox project started in 2002 and was originally written by Ben Litchfield in 2002
and currently lives on SourceForge.  It's initial purpose was to extract text content to be
indexed by the Lucene search engine.  In addition to text extraction it also supports a low
level API for PDF creation and manipulation.
+ In 2006 discussions began with the FOP team to collaborate on a single PDF library within
the Apache organization.  New projects have expressed interest in advancing the functionality
of PDFBox.
+ Recently, Tika also expressed interest in advancing the content extraction capabilities
of PDFBox.
  === Rationale ===
+ The PDF document format is a common format found on internet and across industries as a
way of sharing documents.  Several Apache projects utilize PDF technologies but there is not
a single independent PDF library within the Apache organization.  
+ The Apache FOP project has many features that overlap those of PDFBox and is currently a
duplication of effort, bringing PDFBox into Apache and combining our efforts will result in
a more robust PDF library that will be able to support many more use cases for working with
PDF technologies.
  === Initial Goals ===
+ The initial goals are:
+   * Advanced text extraction techniques
+   * Increase community involvement
+   * Cooperation with existing Apache projects such as FOP
+   * Increasing support for PDF document features
+   * Adding a high level API for document creation
+   * Adding a streaming API for document creation
  == Current Status ==
  === Meritocracy ===
+ Not all initial committers are familiar with the meritocracy principles of Apache.  It is
expected that the committers that are not will learn the meritocracy rules and they will be
followed through the life of the project.
  === Community ===
+ PDFBox has existed for several years on SourceForge and has an active community and continues
to grow each day.  There are hundreds of existing projects that utilize the current version
of PDFBox.
  === Core Developers ===
+ Ben Litchfield is the main developer on this project although it is expected that developers
from a variety of existing Apache projects will become part of the team.
  === Alignment ===

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message