ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Savova, Guergana" <Guergana.Sav...@childrens.harvard.edu>
Subject RE: cTakes chunking problem.
Date Fri, 31 Oct 2014 14:38:59 GMT
The models for POS tagger and constituency parser use the implementations of OpenNLP. However,
no OpenNLP models are used in cTAKES. The cTAKES models are trained on a combination of Penn
Treebank, GENIA and clinical data (clinical data is about 500K words). Our experiments show
maximized performance across the three corpora when the combined data is used.

Penn Treebank annotation guidelines were extended to the clinical domain to capture the specificities
of the clinical language. That work was done in collaboration with the LDC. The extended guidelines
are available here:

Hope this helps!

From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Friday, October 31, 2014 10:31 AM
To: user@ctakes.apache.org
Subject: RE: cTakes chunking problem.

There was some domain specific data already used in creating the POS and chunking models

For info on the chunker, see

Tokenization is rule-based within Apache cTAKES.
The default tokenizer is described here

-- James

From: Bala Krishnan [balkiprasanna1984@gmail.com]
Sent: Friday, October 31, 2014 2:25 AM
To: user@ctakes.apache.org
Subject: cTakes chunking problem.

I have just have couple of clarifications. cTakes uses various NLP open source libraries for
sentence tokenization, pos tagging and chunking. Can anyone tell me what is the trained model
used for pos tagging, chunking ? Is it based on Genia corpus. I tried using genia tagger but
it is giving me different results from the cTakes. Can anyone suggest me some ideas on incorporating
domain specific corpora for tagging and chunking in cTakes ?


View raw message