Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E2D310CB2 for ; Wed, 22 Jan 2014 20:47:41 +0000 (UTC) Received: (qmail 40296 invoked by uid 500); 22 Jan 2014 20:47:40 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 40192 invoked by uid 500); 22 Jan 2014 20:47:40 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 40183 invoked by uid 99); 22 Jan 2014 20:47:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 20:47:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Sean.Finan@childrens.harvard.edu designates 134.174.13.92 as permitted sender) Received: from [134.174.13.92] (HELO mailsmtp2.childrenshospital.org) (134.174.13.92) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 20:47:32 +0000 Received: from pps.filterd (mailsmtp2.childrenshospital.org [127.0.0.1]) by mailsmtp2.childrenshospital.org (8.14.5/8.14.5) with SMTP id s0MKglFS018609 for ; Wed, 22 Jan 2014 15:47:12 -0500 Received: from smtpndc2.chboston.org (smtpndc2.chboston.org [10.20.50.105]) by mailsmtp2.childrenshospital.org with ESMTP id 1hjmdwrb6f-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 22 Jan 2014 15:47:11 -0500 Received: from pps.filterd (smtpndc2.chboston.org [127.0.0.1]) by smtpndc2.chboston.org (8.14.5/8.14.5) with SMTP id s0MKjCfK026230 for ; Wed, 22 Jan 2014 15:47:10 -0500 Received: from chexhubcasbdc2.chboston.org (chexhubcasbdc2.chboston.org [10.20.18.93]) by smtpndc2.chboston.org with ESMTP id 1hjmc406gq-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Wed, 22 Jan 2014 15:47:10 -0500 Received: from CHEXMBX2A.CHBOSTON.ORG ([fe80::890c:1b68:e6dc:cd6e]) by CHEXHUBCASBDC2.CHBOSTON.ORG ([::1]) with mapi id 14.03.0169.001; Wed, 22 Jan 2014 15:47:10 -0500 From: "Finan, Sean" To: "dev@ctakes.apache.org" Subject: RE: sentence detector newline behavior Thread-Topic: sentence detector newline behavior Thread-Index: Ac5WG7UABEU57a2lTXuxbf1ulBTIZAA6h/4AL8HGn4AACRgpgP//41cjgAE1IICAAhGhAIAAQt9g///e6QCAAFLhUIAAm45g Date: Wed, 22 Jan 2014 20:47:09 +0000 Message-ID: <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG> References: <996FC801C05DF64A84246A106FACACD010AAD8@MSGPEXCHA08A.mfad.mfroot.org> <519B8C79.7060607@childrens.harvard.edu> <82291210-B468-49DF-BDC0-BAB09C84CAAE@colorado.edu> <01F1B83B-C2EE-45B5-A47B-8BCE096CD419@colorado.edu> <519C8D92.7080407@gmail.com> <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com> <52DE4BFD.803@gmail.com> <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> In-Reply-To: <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.7.2.53] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.87,1.0.14,0.0.0000 definitions=2014-01-22_07:2014-01-22,2014-01-22,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.87,1.0.14,0.0.0000 definitions=2014-01-22_07:2014-01-22,2014-01-22,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1305240000 definitions=main-1401220153 X-Virus-Checked: Checked by ClamAV on apache.org On my end it looks like my email was reformatted and some of my -newline- = removed in those last examples ...=20 -----Original Message----- From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]=20 Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James > but then no typical sentence ending punctuation at the end of the line Gotcha. =20 > So simply using Lines would not suffice in those cases because it=20 > would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks-= in addition to -newline-. In other words, a Sentence being what cTakes de= tects by ignoring CR/LF, and Lines being those Sentences subdivided by -new= line-. Perhaps "Line" is a horrible moniker. Regardless, it doesn't solv= e the problem of inappropriately missing punctuation. I was focused a litt= le more on the difference between persistent auto- line wrapping and struct= ured information like lists, where the first benefits from Sentence and the= second from Line. "The Patient has been prescribed two medications."=20 "Prescriptions: Advil Tylenol No Aspirin" However, when it comes to the problem that you mention, there is no benefit= to a Line. "The patient has been seen six times in the past week. Pain has been persi= stent for ten days Advil and Tylenol have been prescribed" -- 2 sentences, 3 lines "The patient has been seen six times in the past week. =20 Pain has been persistent for ten days Advil and Tylenol have been prescribed" -- 2 sentences, 3 lines "The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol = have been prescribed" -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -----Original Message----- From: Masanz, James J. [mailto:Masanz.James@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but th= en no typical sentence ending punctuation at the end of the line (or no pun= ctuation at all at the end of the line). And in those sections, negation ca= n be important. So simply using Lines would not suffice in those cases bec= ause it would run together sentences where there are more than one on a lin= e. And using sentences alone (as found by OpenNLP 1.5) would not suffice be= cause it would run together sentences from different lines. -----Original Message----- From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one dir= ection or the other, we can take a poll of when & where cTakes/Ytex/OpenNLP= /Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimi= ted PLUS -sentence-) If some capabilities like negation detection require -lines- then would it = make more sense to have Sentence ignore -newline- and negation detection it= self split the Sentence into line items? If an annotator is interested in = list items, each of which may be on a distinct -line-, then it can split up= the Sentence as needed. I think that James hints that cTakes code already= does this in some places. =20 If a good deal of functionality requires -newline- delimited types, would i= t make sense to introduce a type Line? If something uses a structured list= it could iterate through Line types, while something using pure text could= iterate through Sentence types. This facilitates section-by-section diffe= rent behavior, does not require any decision on global defaults, and makes = data selection for training Sentence a nonesuch wrt line breaks. However, = it adds to the system and would require a per-use choice decision by develo= pers OR a toggle by users (back to the default decision). Perhaps this ha= s already been tried? Sean -----Original Message----- From: Masanz, James J. [mailto:Masanz.James@mayo.edu] Sent: Wednesday, January 22, 2014 1:06 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior The only rule I know of is that cTAKES (prior to ytex integration) always f= orces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had new= lines in the middle of a sentence, but did need sentence breaks to occur at= end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it s= eems to not be ubiquitous. >From others' posts, it seems that we could use an option in cTAKES to turn = off this forcing of sentence breaks at newlines (or depending on how you lo= ok at it, an option to turn on the forcing of sentence breaks if we change = the default behavior) I think we (cTAKES) need to decide the following: - do we want to do this for entire notes, or would it be worth it to have= it be on a section-by-section basis. - what do we make the default behavior - to force or not to force newlines= to be sentence breaks - what data (that contains newlines) will we use for training the sentence= detector Regardless of those answers, I think OpenNLP support for including newlines= in training data would be valuable for those others who have sentences tha= t span lines. And having an option on OpenNLP to always break at newline w= ould be useful for at least some cTAKES users (and we could remove the cTAK= ES code that does that) -- James -----Original Message----- From: dev-return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org [mailto:dev= -return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf Of J=F6rn= Kottmann Sent: Tuesday, January 21, 2014 4:29 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior Yes, exactly, OPENNLP-602 is about training a sentence detector model which= can use a new line as a end-of-sentence character. In case you have certain rules to split sentences we should have a look at = them. The Sentence Detector could be extended to support a user provided ru= le based splitter. If there is an interest in that we could probably get it= into 1.6.0 as well. J=F6rn On 01/20/2014 10:02 PM, Chen, Pei wrote: > I presume Joern was suggesting that if he supports new lines in the openn= lp SentenceDectector (either part of the trained models or post processing = with some rules?) cTAKES will be able to use it out of the box and we shoul= d be able remove any additional custom logic that we currently have- which = seems like a good idea. > > [but when to use within cTAKES individual components such as negation=20 > might be another discussion?] --Pei > >> On Jan 20, 2014, at 12:46 PM, "vijay garla" wrote: >> >> The sentence detection opennlp model used by ctakes does not split=20 >> sentences at newlines - there is additional logic in the takes=20 >> sentence splitter that does this (and an alternative impl that=20 >> doesn't is in the ytex branch). Afaik no retraining / change to the=20 >> feature representation is necessary. >> >> Vj >> >>> On Monday, January 20, 2014, J=F6rn Kottmann wrote= : >>> >>> Hi all, >>> >>> currently I have quite a bit of time to work on OpenNLP, and would=20 >>> like to help you out with this issue. >>> >>> Here is the follow up issue for this change: >>> https://issues.apache.org/jira/browse/OPENNLP-602 >>> >>> I am still trying to figure out what would be the best option to=20 >>> implement this. >>> In the training data a user could just use a special tag to identify=20 >>> the chars. >>> >>> Instead of it might be better to use and to=20 >>> encode these two chars in the training data. Any thoughts? >>> >>> I am planning to release this as part of OpenNLP 1.6.0. >>> >>> Thanks, >>> J=F6rn >>> >>>> On 05/22/2013 02:03 PM, J=F6rn Kottmann wrote: >>>> >>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote: >>>>> >>>>> That's awesome! It might be worth trying at least. How does the=20 >>>>> training process change? Previously the training data would be one=20 >>>>> sentence per line, but with newlines as possible mid-sentence=20 >>>>> characters that could be trouble, is there a new representation=20 >>>>> for training data? Or would we have to use the training api? >>>> Good point, yes that will be a problem with the default training=20 >>>> format, but it shouldn't be hard to solve. In the format itself we=20 >>>> could define a new line tag e.g. >>>> to mark new lines. >>>> as a hack to make it work with 1.5.3 you could instead use a=20 >>>> special char as a replacement for the new line char. >>>> When you pass the text down to the sentence detector a simple=20 >>>> string replace could be used to convert all new line chars to the=20 >>>> special new line marker char. >>>> >>>> If things work out for you performance wise as well we will just=20 >>>> integrate it properly into OpenNLP for the next release. >>>> >>>> Could you produce a sentence detector training file with a new line=20 >>>> marker char? >>>> >>>> You should try to pick a char you can also pass in on a terminal=20 >>>> otherwise you have to use the API to train the model. The build in=20 >>>> cross validation could be used to evaluate the performance. >>>> >>>> J=F6rn >>>