ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Savova, Guergana" <Guergana.Sav...@childrens.harvard.edu>
Subject RE: sentence detector newline behavior
Date Tue, 21 May 2013 14:10:21 GMT
In the clinical narrative there are many sections that are enumerations and where a new line
character must be treated as a sentence break. For example, Current Medications in which each
line contains a medication and its signature.

The format of the MIMIC notes is a bit strange as there are many new line characters in the
middle of the sentences which is imposed by the native application the notes were created
in (cannot remember the name of the app) which has a character window and then a new line
is inserted at the end of that window. I believe we have a pre-processing script that deals
with this issue.

-----Original Message-----
From: Steven Bethard [mailto:steven.bethard@Colorado.EDU] 
Sent: Tuesday, May 21, 2013 9:59 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

On May 21, 2013, at 6:07 AM, "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu>
> The sentence detector always ends a sentence where there are newlines.
> This is a problem for some notes (e.g. MIMIC radiology notes) where a 
> line can wrap in the  middle of a sentence at specified character 
> offsets. In the comments for SentenceDetector, it seems to be split up 
> very logically in that it first runs the opennlp sentence detector, 
> then breaks any detected sentence wherever there is a newline. Questions:
> 1) Would it be good to add a boolean parameter for breaking on newlines?
> 2) If that section was removed/avoided, does the opennlp sentence 
> detector give good results given our model? Or is the model trained on 
> text that always breaks at carriage returns?

For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, we only add
extra sentences when there are *multiple* newlines in a row, i.e. "\\s*\\n\\s*\\n\\s*".

And it certainly seems like a good idea to me to have some way of disabling the "every newline
is the end of a sentence" behavior. That seems like a particularly bad default behavior for
most real text.


View raw message