incubator-ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <>
Subject Re: How to turn off the tokenizer and sentence boundary module in clinical pipeline?
Date Wed, 20 Mar 2013 11:16:51 GMT
What do you mean by "default tokens and sentence boundaries?" Does your input data have gold
standard (human annotated) information about these or just spaces and newlines?

Many downstream components use the token and sentence types created by the first two components,
so if you want to do dictionary lookup you will need those types present somehow. If you have
gold standard information to use then the typical approach is to write a CollectionReader
that can take in your gold standard data as well as the text and create the Token and Sentence
annotations. Then you could create a pipeline that is a subset of the AggregatePlaintextUMLSProcessor
without those two components.

If you don't have gold standard tokens and sentences, but you think cTAKES is not performing
correctly on your data, then the best recourse is to try to create your own tokenizer and
sentence detector. If your format is very simple to do with a rule-based approach then this
may be preferable, the current models are somewhat trained to work on data without particular
predictable formatting.

Hope this helps,

On Mar 19, 2013, at 6:26 PM, Yonghui Wu wrote:

Hi All,

Currently, I'm using the apache-ctakes-3.0.0-incubating<>
with clinical pipeline: AggregatePlaintextUMLSProcessor.xml.

Is there any way  to turn off the tokenization and sentence boundary to force the pipeline
use the default tokens and sentence boundaries, so that we can align the CTAKEs out put with
the original text.


View raw message