ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: sentence detector newline behavior
Date Wed, 22 Jan 2014 20:05:48 GMT

I know there are notes where there are multiple sentences on a line, but then no typical sentence
ending punctuation at the end of the line (or no punctuation at all at the end of the line).
And in those sections, negation can be important.  So simply using Lines would not suffice
in those cases because it would run together sentences where there are more than one on a
line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would
run together sentences from different lines.

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one direction or the other,
we can take a poll of when & where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring
CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would it make more sense
to have Sentence ignore -newline- and negation detection itself split the Sentence into line
items?  If an annotator is interested in list items, each of which may be on a distinct -line-,
then it can split up the Sentence as needed.  I think that James hints that cTakes code already
does this in some places.  

If a good deal of functionality requires -newline- delimited types, would it make sense to
introduce a type Line?  If something uses a structured list it could iterate through Line
types, while something using pure text could iterate through Sentence types.  This facilitates
section-by-section different behavior, does not require any decision on global defaults, and
makes data selection for training Sentence a nonesuch wrt line breaks.  However, it adds to
the system and would require a per-use choice decision by developers OR a toggle by users
(back to the default decision).   Perhaps this has already been tried?


-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu] 
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always forces a sentence
break at a newline.
This was because the clinical notes cTAKES original processed never had newlines in the middle
of a sentence, but did need sentence breaks to occur at end of sentence for good negation
detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it seems to not be

>From others' posts, it seems that we could use an option in cTAKES to turn off this forcing
of sentence breaks at newlines (or depending on how you look at it, an option to turn on the
forcing of sentence breaks if we change the default behavior)

I think we (cTAKES) need to decide the following:
 - do we want to do this for entire notes, or would it be  worth it to have it be on a section-by-section
 - what do we make the default behavior - to force or not to force newlines to be sentence
 - what data (that contains newlines) will we use for training the sentence detector

Regardless of those answers, I think OpenNLP support for including newlines in training data
would be valuable for those others who have sentences that span lines.  And having an option
on OpenNLP to always break at newline would be useful for at least some cTAKES users (and
we could remove the cTAKES code that does that)

-- James

-----Original Message-----
From: dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Jörn Kottmann
Sent: Tuesday, January 21, 2014 4:29 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

Yes, exactly, OPENNLP-602 is about training a sentence detector model which can use a new
line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look at them. The Sentence
Detector could be extended to support a user provided rule based splitter. If there is an
interest in that we could probably get it into 1.6.0 as well.


On 01/20/2014 10:02 PM, Chen, Pei wrote:
> I presume Joern was suggesting that if he supports new lines in the opennlp SentenceDectector
(either part of the trained models or post processing with some rules?) cTAKES will be able
to use it out of the box and we should be able remove any additional custom logic that we
currently have- which seems like a good idea.
> [but when to use within cTAKES individual components such as negation 
> might be another discussion?] --Pei
>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com> wrote:
>> The sentence detection opennlp model used by ctakes does not split 
>> sentences at newlines - there is additional logic in the takes 
>> sentence splitter that does this (and an alternative impl that 
>> doesn't is in the ytex branch). Afaik no retraining / change to the 
>> feature representation is necessary.
>> Vj
>>> On Monday, January 20, 2014, Jörn Kottmann <kottmann@gmail.com> wrote:
>>> Hi all,
>>> currently I have quite a bit of time to work on OpenNLP, and would 
>>> like to help you out with this issue.
>>> Here is the follow up issue for this change:
>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>> I am still trying to figure out what would be the best option to 
>>> implement this.
>>> In the training data a user could just use a special tag to identify 
>>> the chars.
>>> Instead of <NEWLINE> it might be better to use <CR> and <LF>
>>> encode these two chars in the training data. Any thoughts?
>>> I am planning to release this as part of OpenNLP 1.6.0.
>>> Thanks,
>>> Jörn
>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>> That's awesome! It might be worth trying at least. How does the 
>>>>> training process change? Previously the training data would be one 
>>>>> sentence per line, but with newlines as possible mid-sentence 
>>>>> characters that could be trouble, is there a new representation 
>>>>> for training data? Or would we have to use the training api?
>>>> Good point, yes that will be a problem with the default training 
>>>> format, but it shouldn't be hard to solve. In the format itself we 
>>>> could define a new line tag e.g.
>>>> <NEWLINE> to mark new lines.
>>>> as a hack to make it work with 1.5.3 you could instead use a 
>>>> special char as a replacement for the new line char.
>>>> When you pass the text down to the sentence detector a simple 
>>>> string replace could be used to convert all new line chars to the 
>>>> special new line marker char.
>>>> If things work out for you performance wise as well we will just 
>>>> integrate it properly into OpenNLP for the next release.
>>>> Could you produce a sentence detector training file with a new line 
>>>> marker char?
>>>> You should try to pick a char you can also pass in on a terminal 
>>>> otherwise you have to use the API to train the model. The build in 
>>>> cross validation could be used to evaluate the performance.
>>>> Jörn

View raw message