incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "cTAKESProposal" by PeiChen
Date Wed, 30 May 2012 20:20:11 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "cTAKESProposal" page has been changed by PeiChen:

New page:
= cTAKES Proposal =
The following is a proposal for a new top-level project within the ASF.

== Abstract ==
cTAKES: clinical Text Analysis and Knowledge Extraction System is an natural language processing
tool for information extraction from electronic medical record clinical free-text. 

== Proposal ==
cTAKES (clinical Text Analysis and Knowledge Extraction System)

== Background ==
cTAKES comprises a collection of components and tooling written in Java specifically trained
for the clinical domain, and creates rich linguistic and semantic annotations that can be
utilized by clinical decision support systems & clinical research.
The development of cTAKES started in 2006 by a team of physicians, computer scientists and
software engineers at the Mayo Clinic. The development team was led by Dr. Guergana Savova
& Dr. Christopher Chute. cTAKES is released open source under an Apache v2.0 license.
This system was deployed at Mayo and is currently an integral part of their clinical data
management infrastructure and has processed in excess of 80 million clinical notes.
Currently, the core development team is co-located at Mayo Clinic and Children's Hospital
Boston following Dr. Savova's move to Children's Hospital Boston in early 2010. Additional
collaborations with external groups at University of Colorado, Brandeis University, University
of Pittsburgh, University of California at San Diego continue to extend the capabilities of
cTAKES into areas such Temporal Reasoning, Clinical Question and Answering, and coreference
resolution for the clinical domain.  In 2010, cTAKES was adopted by the I2B2 program and is
a central component of the SHARP Area 4.  The current cTAKES components include:
 * Sentence boundary detector
 * Rule-based tokenizer to separate punctuations from words
 * Normalizer
 * Context dependent tokenizer
 * Part-of-speech tagger
 * Phrasal chunker
 * Dictionary lookup annotator and normalization to an ontology
 * Context annotator
 * Negation detector
 * Dependency parser
 * Constituency parser
 * Semantic Role Labeler
 * Coreference resolver
 * Module for the identification of patient smoking status
 * Drug mention annotator

== Rationale ==
We believe there is a clear gap between cutting edge technologies developed out of research
labs and in the clinical practice.  We believe that moving cTAKES development to the Apache
development community will lead to faster innovation, better integration with other open source
software, and broader adoption of cTAKES within clinical institutions and improve our healthcare
system.  We believe that having cTAKES on Apache will encourage the development of a basic
set of open source components that will jumpstart these developers' efforts. 

== Initial Goals ==
The initial goals of the proposed project are: 
 * Bring the community together at the ASF and make the development process transparent for
 * Write user documentation about all major components 
 * Automated build/continuous integration
 * Automate regression tests 
 * Produce an Incubating release 

== Current Status ==
=== Meritocracy ===
Some of the initial committers are familiar with Apache's idea of meritocracy, others aren't.
We will get everybody on the same level as part of the incubation process. 
=== Community ===
cTAKES already has a considerable user base, both in industry and academia. 
=== Core Developers ===
See the initial committer list.

=== Alignment ===
cTAKES has tie-ins with several existing Apache projects. We have been building our components
using the UIMA framework. We are also reusing existing Apache projects such as Lucene, Solr,
Maven. We expect these collaborations to strengthen further after our move to Apache and experiment
with other projects under the Lucene umbrella such as Hadoop and Mahout. 
Another obvious connection exists to some of the projects under the OpenNLP umbrella.

== Known Risks ==
=== Orphaned products ===
The project has been around for quite a number of years already, it has a well-established
user community and a diverse set of committers. 
=== Inexperience with Open Source ===
cTAKES has been an open source project for many years. Many of the developers are already
familiar with both open source in general and the ASF in particular. 
=== Homogenous Developers ===
The current group of developers is very diverse and spans globally and across multiple institutions.
=== Reliance on Salaried Developers ===
Most of the developers are not paid to work specifically on cTAKES, so there is little reliance
on salaried developers. 

=== Relationships with Other Apache Products ===
NLP is often used in search and other algorithms that work with unstructured data, thus cTAKES
is likely to be useful to the Lucene and Solr communities. It also aligns nicely with both
Mahout and UIMA as well as OpenNLP.
=== A Excessive Fascination with the Apache Brand ===
We think the project aligns nicely with the goals of the ASF to disseminate source code to
the public free of charge. Clinical NLP has long been the subject of cutting edge research,
but is often lacking in community and shared knowledge. We believe that by bringing cTAKES
to the ASF, the Apache brand will help deliver clinical NLP capabilities to a much larger
audience and likewise a cutting edge project like cTAKES can further the ASF brand by providing
users with tried and true, as well as new, natural language processing capabilities. 
== Documentation ==
== Initial Source ==
The source code is maintained in SVN on SourceForge: 

== Source and Intellectual Property Submission Plan ==
The cTAKES source code is already open source under the AL 2.0. 
== External Dependencies ==
||'''Library''' ||||<style="text-align: center;">'''License''' ||||<style="text-align:
center;">'''Description''' ||
||JWNL ||||<style="text-align: center;">BSD ||||<style="text-align: center;">Java
Wordnet Library ||
||JUnit ||||<style="text-align: center;">CPL ||||<style="text-align: center;">Unit
Testing Framework ||
||UIMA ||||<style="text-align: center;">AL 2.0 ||||<style="text-align: center;">Unstructured
Information Management Architecture ||

== Cryptography ==
cTAKES neither provides nor uses any cryptography. 
== Required Resources ==
=== Mailing lists ===
 * ctakes-dev 
 * ctakes-private 
 * ctakes-user 
 * ctakes-commits 

=== Subversion Directory ===
=== Issue Tracking ===
=== Other Resources ===
== Initial Committers ==
||'''Name''' ||||<style="text-align: center;">'''Email''' ||||<style="text-align:
center;">'''CLA''' ||
||Thilo Goetz ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
||Grant Ingersoll ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
||Jörn Kottmann ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
||Thomas Morton ||||<style="text-align: center;"> ||||<style="text-align:
center;">no ||
||William Silva ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
||Jason Baldridge ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||
||James Kosin ||||<style="text-align: center;"> ||||<style="text-align:
center;">yes ||

== Affiliations ==

== Sponsors ==
=== Champion ===
Jörn Kottmann
=== Nominated Mentors ===
Marshall Schor
Benson Margulies  
Jörn Kottmann
Grant Ingersoll 

=== Sponsoring Entity ===
The Apache Incubator

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message