Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates
 74.125.83.46 as permitted sender)
Message-ID: <52E2D79C.60101@gmail.com>
Date: Fri, 24 Jan 2014 22:14:04 +0100
From: =?ISO-8859-1?Q?J=F6rn_Kottmann?= <kottmann@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0
MIME-Version: 1.0
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior
References: <E084D8EFE2B03A408B324458C5212E9421157B32@CHEXMBX3A.CHBOSTON.ORG>
 <E5A9FA5ABBF1CA4085D4F0794852A51E2102B14D@CHEXMBX3A.CHBOSTON.ORG>
 <AF4BFC93-4C26-43AF-9D8C-57D670D90F94@colorado.edu>
 <519C8D92.7080407@gmail.com>
 <E084D8EFE2B03A408B324458C5212E94211587FD@CHEXMBX3A.CHBOSTON.ORG>
 <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com>
 <CADGOtThHQw25_KKda6aDLT-+Ruiz5a60VW9AQb3_W4scbV_K1Q@mail.gmail.com>
 <F7475864-AF69-46D6-A104-CEE1F7B2A346@childrens.harvard.edu>
 <52DE4BFD.803@gmail.com> <d2fb82$86rb39@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG>
 <d2fb82$86ssqn@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG>
 <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG>
 <CADGOtTiSN0XQoBmYwQ7KSbGBSTvbghSrNJm7eXMggraB6fSd_g@mail.gmail.com>
 <CAOf_dRkSOBWnnHGXXr4h_=r+Jsbsb9J4PsOQUuoDq7eAAGpn_A@mail.gmail.com>
 <52E1844D.3010507@childrens.harvard.edu>
In-Reply-To: <52E1844D.3010507@childrens.harvard.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

The changes are now committed.

To train a model which can recognize new lines the new lines must be encoded
with the <CR> or <LF> tags (or both).

The same tags are used to pass in the eos chars to the command line trainer.
For example:
SentenceDetectorCrossValidator  -lang en -data /home/xyz/eos-cr.all 
-encoding ISO-8859-15 -eosChars .!?:<LF>

Tim, it would be nice if you could test this with your annotations.

J�rn

On 01/23/2014 10:06 PM, Tim Miller wrote:
> Just an FYI, a while back I did some of these annotations myself on 
> MIMIC to get around this issue. I replaced the newline character with 
> a special (non-English) character, then pre-processed ctakes input to 
> replace newlines with that character, then did sentence detection, 
> then added the newlines back in. I would be happy to share these 
> annotations and my code modifications.
> Tim
>
>
> On 01/23/2014 04:01 PM, Karthik Sarma wrote:
>> We could possibly add some additional datasets for training. MIMIC data
>> does come to mind -- I can't remember off the top of my head if the 
>> MIMIC
>> dataset has sentences spanning lines or not.
>>
>>
>>
>>
>>
>> -- 
>> Karthik Sarma
>> UCLA Medical Scientist Training Program Class of 20??
>> Member, UCLA Medical Imaging & Informatics Lab
>> Member, CA Delegation to the House of Delegates of the American Medical
>> Association
>> ksarma@ksarma.com
>> gchat: ksarma@gmail.com
>> linkedin: www.linkedin.com/in/ksarma
>>
>>
>> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vngarla@gmail.com> wrote:
>>
>>> Just to clarify - with the YTEX branch there are 2 sentence splitter 
>>> - the
>>> original ctakes sentence that splits on newlines, and the ytex sentence
>>> splitter that doesn't.  the changes to other components in the ytex 
>>> branch
>>> (dependency parser, assertion) work with both sentence splitters.
>>>
>>> I think it would be great if the intelligence regarding how to split 
>>> was in
>>> the opennlp model, but this requires training data.  I don't know 
>>> what the
>>> training data is, or if the training data has sentences that cross 
>>> newline
>>> boundaries (if not, won't buy us anything).
>>>
>>> vijay
>>>
>>>
>>>
>>>
>>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
>>> Sean.Finan@childrens.harvard.edu> wrote:
>>>
>>>> On  my end it looks like my email was reformatted and some of my
>>> -newline-
>>>> removed in those last examples ...
>>>>
>>>> -----Original Message-----
>>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>>> Sent: Wednesday, January 22, 2014 3:42 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: RE: sentence detector newline behavior
>>>>
>>>> Thanks James
>>>>
>>>>> but then no typical sentence ending punctuation at the end of the 
>>>>> line
>>>> Gotcha.
>>>>
>>>>> So simply using Lines would not suffice in those cases because it
>>>>> would run together sentences where there are more than one on a line
>>>> I was actually thinking about something like a Line using -sentence
>>>> breaks- in addition to -newline-.  In other words, a Sentence being 
>>>> what
>>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences
>>>> subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
>>>> Regardless, it doesn't solve the problem of inappropriately missing
>>>> punctuation.  I was focused a little more on the difference between
>>>> persistent auto- line wrapping and structured information like lists,
>>> where
>>>> the first benefits from Sentence and the second from Line.
>>>>
>>>> "The Patient has
>>>>   been prescribed two
>>>>   medications."
>>>>
>>>> "Prescriptions:
>>>>    Advil
>>>>    Tylenol
>>>>    No Aspirin"
>>>>
>>>>
>>>> However, when it comes to the problem that you mention, there is no
>>>> benefit to a Line.
>>>>
>>>> "The patient has been seen six times in the past week.  Pain has been
>>>> persistent for ten days Advil and Tylenol have been prescribed"
>>>> -- 2 sentences, 3 lines
>>>>
>>>>
>>>> "The patient has been seen six times in the past week.
>>>> Pain has been persistent for ten days
>>>> Advil and Tylenol have been prescribed"
>>>> -- 2 sentences, 3 lines
>>>>
>>>> "The patient has been seen six times in
>>>>   the past week.  Pain has been persistent  for ten days Advil and
>>> Tylenol
>>>> have been prescribed"
>>>> -- 2 sentences, 5 lines
>>>>
>>>> Nothing can really be done for the last bit where punctuation is 
>>>> missing.
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>> Sent: Wednesday, January 22, 2014 3:07 PM
>>>> To: 'dev@ctakes.apache.org'
>>>> Subject: RE: sentence detector newline behavior
>>>>
>>>>
>>>> I know there are notes where there are multiple sentences on a 
>>>> line, but
>>>> then no typical sentence ending punctuation at the end of the line 
>>>> (or no
>>>> punctuation at all at the end of the line). And in those sections,
>>> negation
>>>> can be important.  So simply using Lines would not suffice in those 
>>>> cases
>>>> because it would run together sentences where there are more than 
>>>> one on
>>> a
>>>> line. And using sentences alone (as found by OpenNLP 1.5) would not
>>> suffice
>>>> because it would run together sentences from different lines.
>>>>
>>>> -----Original Message-----
>>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>>> Sent: Wednesday, January 22, 2014 1:33 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: RE: sentence detector newline behavior
>>>>
>>>> Just whistling in the wind here ...
>>>>
>>>> Perhaps before any changes are made to universally toggle cTakes in 
>>>> one
>>>> direction or the other, we can take a poll of when & where
>>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
>>> to a
>>>> Line (CR/LF delimited PLUS -sentence-)
>>>>
>>>> If some capabilities like negation detection require -lines- then 
>>>> would
>>> it
>>>> make more sense to have Sentence ignore -newline- and negation 
>>>> detection
>>>> itself split the Sentence into line items?  If an annotator is 
>>>> interested
>>>> in list items, each of which may be on a distinct -line-, then it can
>>> split
>>>> up the Sentence as needed.  I think that James hints that cTakes code
>>>> already does this in some places.
>>>>
>>>> If a good deal of functionality requires -newline- delimited types, 
>>>> would
>>>> it make sense to introduce a type Line?  If something uses a 
>>>> structured
>>>> list it could iterate through Line types, while something using 
>>>> pure text
>>>> could iterate through Sentence types.  This facilitates
>>> section-by-section
>>>> different behavior, does not require any decision on global 
>>>> defaults, and
>>>> makes data selection for training Sentence a nonesuch wrt line breaks.
>>>>   However, it adds to the system and would require a per-use choice
>>> decision
>>>> by developers OR a toggle by users (back to the default decision).
>>>> Perhaps this has already been tried?
>>>>
>>>> Sean
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>> Sent: Wednesday, January 22, 2014 1:06 PM
>>>> To: 'dev@ctakes.apache.org'
>>>> Subject: RE: sentence detector newline behavior
>>>>
>>>> The only rule I know of is that cTAKES (prior to ytex integration) 
>>>> always
>>>> forces a sentence break at a newline.
>>>> This was because the clinical notes cTAKES original processed never 
>>>> had
>>>> newlines in the middle of a sentence, but did need sentence breaks to
>>> occur
>>>> at end of sentence for good negation detection on those notes.
>>>> I think Guergana earlier mentioned other EMRs also have this need, 
>>>> but it
>>>> seems to not be ubiquitous.
>>>>
>>>>  From others' posts, it seems that we could use an option in cTAKES to
>>> turn
>>>> off this forcing of sentence breaks at newlines (or depending on 
>>>> how you
>>>> look at it, an option to turn on the forcing of sentence breaks if we
>>>> change the default behavior)
>>>>
>>>> I think we (cTAKES) need to decide the following:
>>>>   - do we want to do this for entire notes, or would it be worth it to
>>>> have it be on a section-by-section basis.
>>>>   - what do we make the default behavior - to force or not to force
>>>> newlines to be sentence breaks
>>>>   - what data (that contains newlines) will we use for training the
>>>> sentence detector
>>>>
>>>> Regardless of those answers, I think OpenNLP support for including
>>>> newlines in training data would be valuable for those others who have
>>>> sentences that span lines.  And having an option on OpenNLP to always
>>> break
>>>> at newline would be useful for at least some cTAKES users (and we 
>>>> could
>>>> remove the cTAKES code that does that)
>>>>
>>>> -- James
>>>>
>>>> -----Original Message-----
>>>> From: dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
>>>> dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
>>>> J�rn Kottmann
>>>> Sent: Tuesday, January 21, 2014 4:29 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: sentence detector newline behavior
>>>>
>>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model
>>>> which can use a new line as a end-of-sentence character.
>>>>
>>>> In case you have certain rules to split sentences we should have a 
>>>> look
>>> at
>>>> them. The Sentence Detector could be extended to support a user 
>>>> provided
>>>> rule based splitter. If there is an interest in that we could probably
>>> get
>>>> it into 1.6.0 as well.
>>>>
>>>> J�rn
>>>>
>>>> On 01/20/2014 10:02 PM, Chen, Pei wrote:
>>>>> I presume Joern was suggesting that if he supports new lines in the
>>>> opennlp SentenceDectector (either part of the trained models or post
>>>> processing with some rules?) cTAKES will be able to use it out of 
>>>> the box
>>>> and we should be able remove any additional custom logic that we
>>> currently
>>>> have- which seems like a good idea.
>>>>> [but when to use within cTAKES individual components such as negation
>>>>> might be another discussion?] --Pei
>>>>>
>>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com>
>>> wrote:
>>>>>> The sentence detection opennlp model used by ctakes does not split
>>>>>> sentences at newlines - there is additional logic in the takes
>>>>>> sentence splitter that does this (and an alternative impl that
>>>>>> doesn't is in the ytex branch). Afaik no retraining / change to the
>>>>>> feature representation is necessary.
>>>>>>
>>>>>> Vj
>>>>>>
>>>>>>> On Monday, January 20, 2014, J�rn Kottmann <kottmann@gmail.com>
>>> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> currently I have quite a bit of time to work on OpenNLP, and would
>>>>>>> like to help you out with this issue.
>>>>>>>
>>>>>>> Here is the follow up issue for this change:
>>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>>>>>
>>>>>>> I am still trying to figure out what would be the best option to
>>>>>>> implement this.
>>>>>>> In the training data a user could just use a special tag to 
>>>>>>> identify
>>>>>>> the chars.
>>>>>>>
>>>>>>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to
>>>>>>> encode these two chars in the training data. Any thoughts?
>>>>>>>
>>>>>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> J�rn
>>>>>>>
>>>>>>>> On 05/22/2013 02:03 PM, J�rn Kottmann wrote:
>>>>>>>>
>>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>>>>>
>>>>>>>>> That's awesome! It might be worth trying at least. How does the
>>>>>>>>> training process change? Previously the training data would be 
>>>>>>>>> one
>>>>>>>>> sentence per line, but with newlines as possible mid-sentence
>>>>>>>>> characters that could be trouble, is there a new representation
>>>>>>>>> for training data? Or would we have to use the training api?
>>>>>>>> Good point, yes that will be a problem with the default training
>>>>>>>> format, but it shouldn't be hard to solve. In the format itself we
>>>>>>>> could define a new line tag e.g.
>>>>>>>> <NEWLINE> to mark new lines.
>>>>>>>> as a hack to make it work with 1.5.3 you could instead use a
>>>>>>>> special char as a replacement for the new line char.
>>>>>>>> When you pass the text down to the sentence detector a simple
>>>>>>>> string replace could be used to convert all new line chars to the
>>>>>>>> special new line marker char.
>>>>>>>>
>>>>>>>> If things work out for you performance wise as well we will just
>>>>>>>> integrate it properly into OpenNLP for the next release.
>>>>>>>>
>>>>>>>> Could you produce a sentence detector training file with a new 
>>>>>>>> line
>>>>>>>> marker char?
>>>>>>>>
>>>>>>>> You should try to pick a char you can also pass in on a terminal
>>>>>>>> otherwise you have to use the API to train the model. The build in
>>>>>>>> cross validation could be used to evaluate the performance.
>>>>>>>>
>>>>>>>> J�rn
>>>>
>