Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Received-SPF: pass (nike.apache.org: domain of
 Sean.Finan@childrens.harvard.edu designates 134.174.13.92 as permitted
 sender)
From: "Finan, Sean" <Sean.Finan@childrens.harvard.edu>
To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
Subject: RE: sentence detector newline behavior
Thread-Topic: sentence detector newline behavior
Thread-Index: 
 Ac5WG7UABEU57a2lTXuxbf1ulBTIZAA6h/4AL8HGn4AACRgpgP//41cjgAE1IICAAhGhAIAAQt9g///e6QCAAFLhUIAAm45g
Date: Wed, 22 Jan 2014 20:47:09 +0000
Message-ID: <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG>
References: <E084D8EFE2B03A408B324458C5212E9421157B32@CHEXMBX3A.CHBOSTON.ORG>
 <996FC801C05DF64A84246A106FACACD010AAD8@MSGPEXCHA08A.mfad.mfroot.org>
 <519B8C79.7060607@childrens.harvard.edu>
 <82291210-B468-49DF-BDC0-BAB09C84CAAE@colorado.edu>
 <E5A9FA5ABBF1CA4085D4F0794852A51E2102B090@CHEXMBX3A.CHBOSTON.ORG>
 <01F1B83B-C2EE-45B5-A47B-8BCE096CD419@colorado.edu>
 <E5A9FA5ABBF1CA4085D4F0794852A51E2102B14D@CHEXMBX3A.CHBOSTON.ORG>
 <AF4BFC93-4C26-43AF-9D8C-57D670D90F94@colorado.edu>
 <519C8D92.7080407@gmail.com>
 <E084D8EFE2B03A408B324458C5212E94211587FD@CHEXMBX3A.CHBOSTON.ORG>
 <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com>
 <CADGOtThHQw25_KKda6aDLT-+Ruiz5a60VW9AQb3_W4scbV_K1Q@mail.gmail.com>
 <F7475864-AF69-46D6-A104-CEE1F7B2A346@childrens.harvard.edu>
 <52DE4BFD.803@gmail.com> <d2fb82$86rb39@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG>
 <d2fb82$86ssqn@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG>
In-Reply-To: <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

On  my end it looks like my email was reformatted and some of my -newline- =
removed in those last examples ...=20

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]=20
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James

> but then no typical sentence ending punctuation at the end of the line

Gotcha. =20

> So simply using Lines would not suffice in those cases because it=20
> would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence breaks-=
 in addition to -newline-.  In other words, a Sentence being what cTakes de=
tects by ignoring CR/LF, and Lines being those Sentences subdivided by -new=
line-.  Perhaps "Line" is a horrible moniker.   Regardless, it doesn't solv=
e the problem of inappropriately missing punctuation.  I was focused a litt=
le more on the difference between persistent auto- line wrapping and struct=
ured information like lists, where the first benefits from Sentence and the=
 second from Line.

"The Patient has
 been prescribed two
 medications."=20

"Prescriptions:
  Advil
  Tylenol
  No Aspirin"


However, when it comes to the problem that you mention, there is no benefit=
 to a Line.

"The patient has been seen six times in the past week.  Pain has been persi=
stent for ten days Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines


"The patient has been seen six times in the past week. =20
Pain has been persistent for ten days
Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines

"The patient has been seen six times in
 the past week.  Pain has been persistent  for ten days  Advil and Tylenol =
have been prescribed"
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.


-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior


I know there are notes where there are multiple sentences on a line, but th=
en no typical sentence ending punctuation at the end of the line (or no pun=
ctuation at all at the end of the line). And in those sections, negation ca=
n be important.  So simply using Lines would not suffice in those cases bec=
ause it would run together sentences where there are more than one on a lin=
e. And using sentences alone (as found by OpenNLP 1.5) would not suffice be=
cause it would run together sentences from different lines.

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one dir=
ection or the other, we can take a poll of when & where cTakes/Ytex/OpenNLP=
/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimi=
ted PLUS -sentence-)

If some capabilities like negation detection require -lines- then would it =
make more sense to have Sentence ignore -newline- and negation detection it=
self split the Sentence into line items?  If an annotator is interested in =
list items, each of which may be on a distinct -line-, then it can split up=
 the Sentence as needed.  I think that James hints that cTakes code already=
 does this in some places. =20

If a good deal of functionality requires -newline- delimited types, would i=
t make sense to introduce a type Line?  If something uses a structured list=
 it could iterate through Line types, while something using pure text could=
 iterate through Sentence types.  This facilitates section-by-section diffe=
rent behavior, does not require any decision on global defaults, and makes =
data selection for training Sentence a nonesuch wrt line breaks.  However, =
it adds to the system and would require a per-use choice decision by develo=
pers OR a toggle by users (back to the default decision).   Perhaps this ha=
s already been tried?

Sean


-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always f=
orces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had new=
lines in the middle of a sentence, but did need sentence breaks to occur at=
 end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it s=
eems to not be ubiquitous.

>From others' posts, it seems that we could use an option in cTAKES to turn =
off this forcing of sentence breaks at newlines (or depending on how you lo=
ok at it, an option to turn on the forcing of sentence breaks if we change =
the default behavior)

I think we (cTAKES) need to decide the following:
 - do we want to do this for entire notes, or would it be  worth it to have=
 it be on a section-by-section basis.
 - what do we make the default behavior - to force or not to force newlines=
 to be sentence breaks
 - what data (that contains newlines) will we use for training the sentence=
 detector

Regardless of those answers, I think OpenNLP support for including newlines=
 in training data would be valuable for those others who have sentences tha=
t span lines.  And having an option on OpenNLP to always break at newline w=
ould be useful for at least some cTAKES users (and we could remove the cTAK=
ES code that does that)

-- James

-----Original Message-----
From: dev-return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org [mailto:dev=
-return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf Of J=F6rn=
 Kottmann
Sent: Tuesday, January 21, 2014 4:29 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

Yes, exactly, OPENNLP-602 is about training a sentence detector model which=
 can use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look at =
them. The Sentence Detector could be extended to support a user provided ru=
le based splitter. If there is an interest in that we could probably get it=
 into 1.6.0 as well.

J=F6rn

On 01/20/2014 10:02 PM, Chen, Pei wrote:
> I presume Joern was suggesting that if he supports new lines in the openn=
lp SentenceDectector (either part of the trained models or post processing =
with some rules?) cTAKES will be able to use it out of the box and we shoul=
d be able remove any additional custom logic that we currently have- which =
seems like a good idea.
>
> [but when to use within cTAKES individual components such as negation=20
> might be another discussion?] --Pei
>
>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com> wrote:
>>
>> The sentence detection opennlp model used by ctakes does not split=20
>> sentences at newlines - there is additional logic in the takes=20
>> sentence splitter that does this (and an alternative impl that=20
>> doesn't is in the ytex branch). Afaik no retraining / change to the=20
>> feature representation is necessary.
>>
>> Vj
>>
>>> On Monday, January 20, 2014, J=F6rn Kottmann <kottmann@gmail.com> wrote=
:
>>>
>>> Hi all,
>>>
>>> currently I have quite a bit of time to work on OpenNLP, and would=20
>>> like to help you out with this issue.
>>>
>>> Here is the follow up issue for this change:
>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>
>>> I am still trying to figure out what would be the best option to=20
>>> implement this.
>>> In the training data a user could just use a special tag to identify=20
>>> the chars.
>>>
>>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to=20
>>> encode these two chars in the training data. Any thoughts?
>>>
>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>
>>> Thanks,
>>> J=F6rn
>>>
>>>> On 05/22/2013 02:03 PM, J=F6rn Kottmann wrote:
>>>>
>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>
>>>>> That's awesome! It might be worth trying at least. How does the=20
>>>>> training process change? Previously the training data would be one=20
>>>>> sentence per line, but with newlines as possible mid-sentence=20
>>>>> characters that could be trouble, is there a new representation=20
>>>>> for training data? Or would we have to use the training api?
>>>> Good point, yes that will be a problem with the default training=20
>>>> format, but it shouldn't be hard to solve. In the format itself we=20
>>>> could define a new line tag e.g.
>>>> <NEWLINE> to mark new lines.
>>>> as a hack to make it work with 1.5.3 you could instead use a=20
>>>> special char as a replacement for the new line char.
>>>> When you pass the text down to the sentence detector a simple=20
>>>> string replace could be used to convert all new line chars to the=20
>>>> special new line marker char.
>>>>
>>>> If things work out for you performance wise as well we will just=20
>>>> integrate it properly into OpenNLP for the next release.
>>>>
>>>> Could you produce a sentence detector training file with a new line=20
>>>> marker char?
>>>>
>>>> You should try to pick a char you can also pass in on a terminal=20
>>>> otherwise you have to use the API to train the model. The build in=20
>>>> cross validation could be used to evaluate the performance.
>>>>
>>>> J=F6rn
>>>