ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Bethard <steven.beth...@Colorado.EDU>
Subject Re: sentence detector newline behavior
Date Tue, 21 May 2013 13:58:00 GMT
On May 21, 2013, at 6:07 AM, "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu>
wrote:
> The sentence detector always ends a sentence where there are newlines.
> This is a problem for some notes (e.g. MIMIC radiology notes) where a
> line can wrap in the  middle of a sentence at specified character
> offsets. In the comments for SentenceDetector, it seems to be split up
> very logically in that it first runs the opennlp sentence detector, then
> breaks any detected sentence wherever there is a newline. Questions:
> 1) Would it be good to add a boolean parameter for breaking on newlines?
> 2) If that section was removed/avoided, does the opennlp sentence
> detector give good results given our model? Or is the model trained on
> text that always breaks at carriage returns?

For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, we only add
extra sentences when there are *multiple* newlines in a row, i.e. "\\s*\\n\\s*\\n\\s*".

And it certainly seems like a good idea to me to have some way of disabling the "every newline
is the end of a sentence" behavior. That seems like a particularly bad default behavior for
most real text.

Steve
Mime
View raw message