ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Bethard <steven.beth...@Colorado.EDU>
Subject Re: sentence detector newline behavior
Date Tue, 21 May 2013 18:00:30 GMT
So perhaps we could re-train it to disambiguate newline characters as well?

Steve

On May 21, 2013, at 11:33 AM, "Savova, Guergana" <Guergana.Savova@childrens.harvard.edu>
wrote:

> The model is trained to disambiguate punctuation characters which in most cases is the
period.
> --Guergana
> 
> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU] 
> Sent: Tuesday, May 21, 2013 12:07 PM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> 
> On May 21, 2013, at 9:53 AM, "Savova, Guergana" <Guergana.Savova@childrens.harvard.edu>
wrote:
>> The OpenNLP sentence segmenter is trained on clinical data (cannot remember exactly
how many sentences were in the training corpus). This is the model distributed with cTAKES.
The only hard rule is the new line.
> 
> If it's trained on clinical data, why does it need a hard rule for that? Why isn't the
model able to learn when to break on a newline or not?
> 
> Steve
> 
>> --Guergana
>> 
>> -----Original Message-----
>> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
>> Sent: Tuesday, May 21, 2013 11:38 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: sentence detector newline behavior
>> 
>> On May 21, 2013, at 9:02 AM, Tim Miller <timothy.miller@childrens.harvard.edu>
wrote:
>>> I think the whole reason to use a machine learning approach for 
>>> sentence detection should be to help weigh evidence with these cases 
>>> where hard rules cause problems, mainly 1) when a period does not end 
>>> a sentence, but also 2) where a newline does and does not mean end of sentence.
>> 
>> Perhaps we should consider re-training the OpenNLP sentence segmenter on some clinical
data? Presumably we can get sentences from the TreeBank annotations.
>> 
>> I don't know much about the OpenNLP sentence segmenter though. Does it only classify
on periods? We'd want to classify all periods and newlines. And we'd want to add features
that capture patterns like "XXX: YYY".
>> 
>> Steve
>> 
>>> It
>>> is of course bad that in your example if you don't put a sentence 
>>> break you will think that "extravascular findings" is negated. But it 
>>> is also bad if you put a sentence break immediately after the word 
>>> "and" at the end of a line and then you find that your language model 
>>> thinks that "and <eos>" is a good bigram.
>>> 
>>> I will create a jira for the parameter thing, and try to implement it 
>>> and see if it gets ok results with the existing model.
>>> Tim
>>> 
>>> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
>>>> +1 for adding a boolean parameter, or perhaps instead a list of 
>>>> +section IDs
>>>> 
>>>> The sentence detector model was trained on data that always breaks at carriage
returns.
>>>> 
>>>> It is important for text that is a list something like this:
>>>> 
>>>> Heart Rate: normal
>>>> ENT: negative
>>>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
>>>> 
>>>> And without breaking on the line ending, the word negative would 
>>>> negate extravascular findings
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org
>>>> [mailto:dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org] On 
>>>> Behalf Of Miller, Timothy
>>>> Sent: Tuesday, May 21, 2013 7:07 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: sentence detector newline behavior
>>>> 
>>>> The sentence detector always ends a sentence where there are newlines.
>>>> This is a problem for some notes (e.g. MIMIC radiology notes) where 
>>>> a line can wrap in the  middle of a sentence at specified character 
>>>> offsets. In the comments for SentenceDetector, it seems to be split 
>>>> up very logically in that it first runs the opennlp sentence 
>>>> detector, then breaks any detected sentence wherever there is a newline.
Questions:
>>>> 1) Would it be good to add a boolean parameter for breaking on newlines?
>>>> 2) If that section was removed/avoided, does the opennlp sentence 
>>>> detector give good results given our model? Or is the model trained 
>>>> on text that always breaks at carriage returns?
>>>> 
>>>> Tim
>>> 
>> 
> 


Mime
View raw message