ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen, Pei" <Pei.C...@childrens.harvard.edu>
Subject RE: sentence detector newline behavior
Date Tue, 21 May 2013 17:12:57 GMT
I presume the combination turned out to perform the best in the past...? (based on James and
Guergana's enum/medication examples)
Having a flag to turn off the hard newline rule seems reasonable if it works.
My 1/2 cent...
(short of having to preprocess MIMIC Radiology formatted notes or retraining?)
--Pei

> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> Sent: Tuesday, May 21, 2013 12:07 PM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> 
> On May 21, 2013, at 9:53 AM, "Savova, Guergana"
> <Guergana.Savova@childrens.harvard.edu> wrote:
> > The OpenNLP sentence segmenter is trained on clinical data (cannot
> remember exactly how many sentences were in the training corpus). This is
> the model distributed with cTAKES. The only hard rule is the new line.
> 
> If it's trained on clinical data, why does it need a hard rule for that? Why isn't
> the model able to learn when to break on a newline or not?
> 
> Steve
> 
> > --Guergana
> >
> > -----Original Message-----
> > From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> > Sent: Tuesday, May 21, 2013 11:38 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: sentence detector newline behavior
> >
> > On May 21, 2013, at 9:02 AM, Tim Miller
> <timothy.miller@childrens.harvard.edu> wrote:
> >> I think the whole reason to use a machine learning approach for
> >> sentence detection should be to help weigh evidence with these cases
> >> where hard rules cause problems, mainly 1) when a period does not end
> >> a sentence, but also 2) where a newline does and does not mean end of
> sentence.
> >
> > Perhaps we should consider re-training the OpenNLP sentence segmenter
> on some clinical data? Presumably we can get sentences from the TreeBank
> annotations.
> >
> > I don't know much about the OpenNLP sentence segmenter though. Does
> it only classify on periods? We'd want to classify all periods and newlines. And
> we'd want to add features that capture patterns like "XXX: YYY".
> >
> > Steve
> >
> >> It
> >> is of course bad that in your example if you don't put a sentence
> >> break you will think that "extravascular findings" is negated. But it
> >> is also bad if you put a sentence break immediately after the word
> >> "and" at the end of a line and then you find that your language model
> >> thinks that "and <eos>" is a good bigram.
> >>
> >> I will create a jira for the parameter thing, and try to implement it
> >> and see if it gets ok results with the existing model.
> >> Tim
> >>
> >> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
> >>> +1 for adding a boolean parameter, or perhaps instead a list of
> >>> +section IDs
> >>>
> >>> The sentence detector model was trained on data that always breaks at
> carriage returns.
> >>>
> >>> It is important for text that is a list something like this:
> >>>
> >>> Heart Rate: normal
> >>> ENT: negative
> >>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
> >>>
> >>> And without breaking on the line ending, the word negative would
> >>> negate extravascular findings
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org
> >>> [mailto:dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org]
> On
> >>> Behalf Of Miller, Timothy
> >>> Sent: Tuesday, May 21, 2013 7:07 AM
> >>> To: dev@ctakes.apache.org
> >>> Subject: sentence detector newline behavior
> >>>
> >>> The sentence detector always ends a sentence where there are
> newlines.
> >>> This is a problem for some notes (e.g. MIMIC radiology notes) where
> >>> a line can wrap in the  middle of a sentence at specified character
> >>> offsets. In the comments for SentenceDetector, it seems to be split
> >>> up very logically in that it first runs the opennlp sentence
> >>> detector, then breaks any detected sentence wherever there is a
> newline. Questions:
> >>> 1) Would it be good to add a boolean parameter for breaking on
> newlines?
> >>> 2) If that section was removed/avoided, does the opennlp sentence
> >>> detector give good results given our model? Or is the model trained
> >>> on text that always breaks at carriage returns?
> >>>
> >>> Tim
> >>
> >


Mime
View raw message