Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8D7951059C for ; Wed, 22 Jan 2014 18:05:33 +0000 (UTC) Received: (qmail 34811 invoked by uid 500); 22 Jan 2014 18:05:28 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 34615 invoked by uid 500); 22 Jan 2014 18:05:25 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 34393 invoked by uid 99); 22 Jan 2014 18:05:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 18:05:22 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=MSGID_FROM_MTA_HEADER,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [129.176.212.47] (HELO mail10.mayo.edu) (129.176.212.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 18:05:17 +0000 Received: from unknown (HELO mail10.mayo.edu) ([10.146.65.138]) by ironport10-dlp.mayo.edu with ESMTP; 22 Jan 2014 12:04:56 -0600 Message-Id: X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqIEAOMH4FIKgMjM/2dsb2JhbABRCoNEqjKSA4EtdIIlAQEBBDwBSAQCAQgNBAQBAQEnByElCQgCBBMch1UDFgi9Lw1XhH8TjGyBODEyBoJ3gTsEljUBiFuFbohogWok Received: from unknown (HELO msgoms04.mayo.edu) ([10.128.200.204]) by ironport10.mayo.edu with ESMTP; 22 Jan 2014 12:04:56 -0600 Date: Wed, 22 Jan 2014 18:04:54 +0000 From: "Masanz, James J." Subject: RE: sentence detector newline behavior In-reply-to: <52DE4BFD.803@gmail.com> To: "'dev@ctakes.apache.org'" MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-language: en-US Content-transfer-encoding: quoted-printable Accept-Language: en-US Thread-topic: sentence detector newline behavior Thread-index: Ac5WG7UABEU57a2lTXuxbf1ulBTIZAA8oG8AL8HGn4AACRgpgAAG5QsAABwp7YAANTGqgA== X-MS-Has-Attach: X-MS-TNEF-Correlator: X-ESET-AS: SCORE=0 References: <996FC801C05DF64A84246A106FACACD010AAD8@MSGPEXCHA08A.mfad.mfroot.org> <519B8C79.7060607@childrens.harvard.edu> <82291210-B468-49DF-BDC0-BAB09C84CAAE@colorado.edu> <01F1B83B-C2EE-45B5-A47B-8BCE096CD419@colorado.edu> <519C8D92.7080407@gmail.com> <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com> <52DE4BFD.803@gmail.com> X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org The only rule I know of is that cTAKES (prior to ytex integration) always f= orces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had new= lines in the middle of a sentence, but did need sentence breaks to occur at= end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it s= eems to not be ubiquitous. >From others' posts, it seems that we could use an option in cTAKES to turn = off this forcing of sentence breaks at newlines (or depending on how you lo= ok at it, an option to turn on the forcing of sentence breaks if we change = the default behavior) I think we (cTAKES) need to decide the following: - do we want to do this for entire notes, or would it be worth it to have= it be on a section-by-section basis. - what do we make the default behavior - to force or not to force newlines= to be sentence breaks - what data (that contains newlines) will we use for training the sentence= detector Regardless of those answers, I think OpenNLP support for including newlines= in training data would be valuable for those others who have sentences tha= t span lines. And having an option on OpenNLP to always break at newline w= ould be useful for at least some cTAKES users (and we could remove the cTAK= ES code that does that) -- James -----Original Message----- From: dev-return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org [mailto:dev= -return-2390-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf Of J=F6rn= Kottmann Sent: Tuesday, January 21, 2014 4:29 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior Yes, exactly, OPENNLP-602 is about training a sentence detector model=20 which can use a new line as a end-of-sentence character. In case you have certain rules to split sentences we should have a look=20 at them. The Sentence Detector could be extended to support a user provided rule based splitter. If there is an interest in that we=20 could probably get it into 1.6.0 as well. J=F6rn On 01/20/2014 10:02 PM, Chen, Pei wrote: > I presume Joern was suggesting that if he supports new lines in the openn= lp SentenceDectector (either part of the trained models or post processing = with some rules?) cTAKES will be able to use it out of the box and we shoul= d be able remove any additional custom logic that we currently have- which = seems like a good idea. > > [but when to use within cTAKES individual components such as negation mig= ht be another discussion?] > --Pei > >> On Jan 20, 2014, at 12:46 PM, "vijay garla" wrote: >> >> The sentence detection opennlp model used by ctakes does not split >> sentences at newlines - there is additional logic in the takes sentence >> splitter that does this (and an alternative impl that doesn't is in the >> ytex branch). Afaik no retraining / change to the feature representation= is >> necessary. >> >> Vj >> >>> On Monday, January 20, 2014, J=F6rn Kottmann wrote= : >>> >>> Hi all, >>> >>> currently I have quite a bit of time to work on OpenNLP, and would like= to >>> help you >>> out with this issue. >>> >>> Here is the follow up issue for this change: >>> https://issues.apache.org/jira/browse/OPENNLP-602 >>> >>> I am still trying to figure out what would be the best option to implem= ent >>> this. >>> In the training data a user could just use a special tag to identify th= e >>> chars. >>> >>> Instead of it might be better to use and to encode >>> these two chars >>> in the training data. Any thoughts? >>> >>> I am planning to release this as part of OpenNLP 1.6.0. >>> >>> Thanks, >>> J=F6rn >>> >>>> On 05/22/2013 02:03 PM, J=F6rn Kottmann wrote: >>>> >>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote: >>>>> >>>>> That's awesome! It might be worth trying at least. How does the train= ing >>>>> process change? Previously the training data would be one sentence pe= r >>>>> line, but with newlines as possible mid-sentence characters that coul= d >>>>> be trouble, is there a new representation for training data? Or would= we >>>>> have to use the training api? >>>> Good point, yes that will be a problem with the default training forma= t, >>>> but it shouldn't be hard >>>> to solve. In the format itself we could define a new line tag e.g. >>>> to mark new lines. >>>> as a hack to make it work with 1.5.3 you could instead use a special c= har >>>> as a replacement >>>> for the new line char. >>>> When you pass the text down to the sentence detector a simple string >>>> replace could be used to >>>> convert all new line chars to the special new line marker char. >>>> >>>> If things work out for you performance wise as well we will just >>>> integrate it properly into OpenNLP >>>> for the next release. >>>> >>>> Could you produce a sentence detector training file with a new line >>>> marker char? >>>> >>>> You should try to pick a char you can also pass in on a terminal >>>> otherwise you have to use the >>>> API to train the model. The build in cross validation could be used to >>>> evaluate the performance. >>>> >>>> J=F6rn >>>