incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "OpenNLPProposal" by JasonBaldridge
Date Wed, 03 Nov 2010 17:31:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "OpenNLPProposal" page has been changed by JasonBaldridge.


  OpenNLP is a Java machine learning toolkit for natural language processing (NLP).
  == Proposal ==
  OpenNLP is a machine learning based toolkit for the processing of natural language text.
 It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech
tagging, named entity extraction, chunking, parsing, and coreference resolution.  These tasks
are usually required to build more advanced text processing services.
  The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned
tasks.  An additional goal is to provide a large number of pre-built models for a variety
of languages, as well as the annotated text resources that those models are derived from.
  == Background ==
+ OpenNLP was started in 2000 by Jason Baldridge and Gann Bierner while they were graduate
students in the Division of Informatics at the University of Edinburgh. The initial codebase
for OpenNLP came out of the Grok natural language parsing toolkit which was used heavily in
both Baldridge's and Bierner's dissertations. The first paper that used Grok, and especially
the components that would become OpenNLP is [[Hockenmaier, Bierner and Baldridge (2000)|]]
(later updated as the journal article [[Hockenmaier, Bierner, and Baldridge (2004)|]].
+ In 2000, Grok was split into two projects: OpenNLP tools for the core natural language processing
infrastructure and the Grok/OpenCCG library ( for parsing with categorial grammar.
Both projects have evolved independently since then and have mostly independent active developer
and user communities. OpenCCG is primarily used in the academic community, while OpenNLP has
considerable use in both academia and industry. As in indication of the academic impact of
OpenNLP, a search on Google scholar (done in March 2010) returned about 650 publications citing
the package. Some of these include the OpenNLP website and a few non-publications plus some
self-citations. Based on a scan of these results, we estimate that about 500 actual publications
have used OpenNLP in their work, and there are an addition 50 or so quasi-publications like
surveys and instruction manuals.
+ The activity level of the OpenNLP project has risen and fallen over that past 10+ years,
with a large uptick in the last two years especially. Most recently, due both to the availability
of new documentation and the release of version 1.5 , there have been many more downloads
and page views for the OpenNLP project. In fact, September 2010 had the most downloads (1,561)
and project web hits (226,391) of any month since the project’s beginning in 2000, and October
is keeping pacing with that figure so far. As a result, OpenNLP has gone from being in the
2000th to 4000th ranked project (between January and May, 2010) to being ranked 570, 314,
181 and 439 for July, August, September, and October respectively. Full details are available
on the Sourceforge statistics page for OpenNLP.  (There are 240,000 projects hosted on SourceForge,
though this figure includes many, many projects that never actually get started: it seems
that about 7-10% of these are stable, active projects based on a review done in 2007.) 
  == Rationale ==
  == Initial Goals ==
+ The initial goals of the proposed project are:
- The initial goals of the proposed project are:
   * Bring the scattered community together and make the development process transparent for
   * Write user documentation about all major components
   * Automated build including train and evaluate regression tests
   * Create a new website
  == Current Status ==
  === Meritocracy ===
  Some of the initial committers are familiar with Apache's idea of meritocracy, others aren't.
 We will get everybody on the same level as part of the incubation process.
  === Community ===
- OpenNLP already has a considerable user base, both in industry and academia.  
+ OpenNLP already has a considerable user base, both in industry and academia.   Core Developers
- Core Developers
  === Alignment ===
  OpenNLP has tie-ins with several existing Apache projects.  We have been distributing wrappers
for UIMA for some time now (two UIMA committers also contribute to OpenNLP).  We expect this
collaboration to strengthen further after our move to Apache.
  Another obvious connection exists to some of the projects under the Lucene umbrella.  On
the one hand, projects like Solr may benefit from the OpenNLP analysis capabilities to create
specialized search for particular domains.  On the other, OpenNLP may benefit from the machine
learning code that is being developed in Mahout, and maybe get some people from that community
to lend a hand.
  == Known Risks ==
  === Orphaned products ===
- The project has been around for quite a number of years already, it has a well-established
user community and a diverse set of committers.  
+ The project has been around for quite a number of years already, it has a well-established
user community and a diverse set of committers.   Inexperience with Open Source
- Inexperience with Open Source
  === Homogenous Developers ===
  The current group of developers is very diverse, no two developers work for the same organization.
  === Reliance on Salaried Developers ===
- Most of the developers are not paid to work on OpenNLP, so there is little reliance on salaried
+ Most of the developers are not paid to work on OpenNLP, so there is little reliance on salaried
developers. Relationships with Other Apache Products
- Relationships with Other Apache Products
  === A Excessive Fascination with the Apache Brand ===
  == Documentation ==
  == Initial Source ==
  The source code is maintained in two CVS repositories on SourceForge.
- OpenNLP Maxent:
+ OpenNLP Maxent:
- OpenNLP Tools and OpenNLP UIMA:
+ OpenNLP Tools and OpenNLP UIMA:
  == Source and Intellectual Property Submission Plan ==
  The OpenNLP source code is already open source under the AL 2.0.
  == External Dependencies ==
- ||'''Library'''||||'''License'''||||'''Description'''||
- ||JWNL||||BSD||||Java Wordnet Library||
- ||JUnit||||CPL||||Unit Testing Framework||
- ||UIMA||||AL 2.0||||Unstructured Information Management Architecture||
+ ||'''Library''' ||||<style="text-align: center;">'''License''' ||||<style="text-align:
center;">'''Description''' ||
+ ||JWNL ||||<style="text-align: center;">BSD ||||<style="text-align: center;">Java
Wordnet Library ||
+ ||JUnit ||||<style="text-align: center;">CPL ||||<style="text-align: center;">Unit
Testing Framework ||
+ ||UIMA ||||<style="text-align: center;">AL 2.0 ||||<style="text-align: center;">Unstructured
Information Management Architecture ||
  == Cryptography ==
  OpenNLP neither provides nor uses any cryptography.
  == Required Resources ==
  === Mailing lists ===
   * opennlp-dev
   * opennlp-private
   * opennlp-user
   * opennlp-commits
  === Subversion Directory ===
  === Issue Tracking ===
  === Other Resources ===
+ == Initial Committers ==
+ ||'''Name''' ||||<style="text-align: center;">'''Email''' ||||<style="text-align:
center;">'''CLA''' ||
+ ||Thilo Goetz ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
+ ||Grant Ingersoll ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
+ ||Jörn Kottmann ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
+ ||Thomas Morton ||||<style="text-align: center;"> ||||<style="text-align:
center;">no ||
+ ||William Colen ||||<style="text-align: center;"> ||||<style="text-align:
center;">no ||
+ ||Jason Baldridge||||<style="text-align: center;">||||<style="text-align:
- == Initial Committers ==
- ||'''Name'''||||'''Email'''||||'''CLA'''||
- ||Thilo Goetz||||||||yes||
- ||Grant Ingersoll||||||||yes||
- ||Jörn Kottmann||||||||yes||
- ||Thomas Morton||||||||no||
- ||William Colen||||||||no||
  == Affiliations ==
+ ||'''Name''' ||||<style="text-align: center;">'''Affiliation''' ||
+ ||Thilo Goetz ||||<style="text-align: center;">IBM ||
+ ||Grant Ingersoll ||||<style="text-align: center;">Lucid Imagination ||
+ ||Jörn Kottmann ||||<style="text-align: center;">Infopaq International A/S ||
+ ||Thomas Morton ||||<style="text-align: center;">Comcast Corporation ||
+ ||William Colen ||||<style="text-align: center;">São Paulo University ||
+ ||Jason Baldridge||||<style="text-align: center;">The University of Texas at Austin||
- ||'''Name'''||||'''Affiliation'''||
- ||Thilo Goetz||||IBM||
- ||Grant Ingersoll||||Lucid Imagination||
- ||Jörn Kottmann||||Infopaq International A/S||
- ||Thomas Morton||||Comcast Corporation||
- ||William Colen||||São Paulo University||
  == Sponsors ==
  === Champion ===
  Grant Ingersoll
  === Nominated Mentors ===
  Grant Ingersoll
  === Sponsoring Entity ===
  The Apache Incubator

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message