Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98A5610BB9 for ; Fri, 24 Jan 2014 21:14:36 +0000 (UTC) Received: (qmail 65158 invoked by uid 500); 24 Jan 2014 21:14:35 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 65074 invoked by uid 500); 24 Jan 2014 21:14:35 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 65066 invoked by uid 99); 24 Jan 2014 21:14:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 21:14:34 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 74.125.83.46 as permitted sender) Received: from [74.125.83.46] (HELO mail-ee0-f46.google.com) (74.125.83.46) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 21:14:30 +0000 Received: by mail-ee0-f46.google.com with SMTP id c13so1168831eek.5 for ; Fri, 24 Jan 2014 13:14:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=dHSV6gvH/vzeINQlRxNJxdAPbQyTKDp7EjUtI7Ksiqs=; b=tQ3I0NlRURfy1I7is/TywzlpS3c6DapthbzxN48rrpBvpF82KJpMNWZ+cNAfAAL8VR +xSeNJt3FSSIUA1uu1CmA3/cWidQYhVvEzsCOiNiywl1fZvAg3P7PmYc+Z7TPKoIbUas y/MAHqjDIQrFyyrO5LJqiyinW70vLlFCGsfYLIPhD0Xtvm/vEZnrhENpSu9t3ku7sfZU 0pOPtIAvKpYg+957zmNlpDlDihX2jeIn58TvbTqq8jErjWirfvaMhwTR1uuXwrfP1KKu 32oKQKRZhrLhIaENI1zsZerMg1KDOkDtl0GiesA1Eb1M7UXTK+ut2Zesa0ralyT3yAPg 4LVQ== X-Received: by 10.14.173.129 with SMTP id v1mr10908646eel.60.1390598048878; Fri, 24 Jan 2014 13:14:08 -0800 (PST) Received: from [192.168.11.40] (12.21-218-195.adsl.internet.lu. [195.218.21.12]) by mx.google.com with ESMTPSA id g1sm7974537eet.6.2014.01.24.13.14.06 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 24 Jan 2014 13:14:07 -0800 (PST) Message-ID: <52E2D79C.60101@gmail.com> Date: Fri, 24 Jan 2014 22:14:04 +0100 From: =?ISO-8859-1?Q?J=F6rn_Kottmann?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0 MIME-Version: 1.0 To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior References: <519C8D92.7080407@gmail.com> <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com> <52DE4BFD.803@gmail.com> <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG> <52E1844D.3010507@childrens.harvard.edu> In-Reply-To: <52E1844D.3010507@childrens.harvard.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org The changes are now committed. To train a model which can recognize new lines the new lines must be encoded with the or tags (or both). The same tags are used to pass in the eos chars to the command line trainer. For example: SentenceDetectorCrossValidator -lang en -data /home/xyz/eos-cr.all -encoding ISO-8859-15 -eosChars .!?: Tim, it would be nice if you could test this with your annotations. J�rn On 01/23/2014 10:06 PM, Tim Miller wrote: > Just an FYI, a while back I did some of these annotations myself on > MIMIC to get around this issue. I replaced the newline character with > a special (non-English) character, then pre-processed ctakes input to > replace newlines with that character, then did sentence detection, > then added the newlines back in. I would be happy to share these > annotations and my code modifications. > Tim > > > On 01/23/2014 04:01 PM, Karthik Sarma wrote: >> We could possibly add some additional datasets for training. MIMIC data >> does come to mind -- I can't remember off the top of my head if the >> MIMIC >> dataset has sentences spanning lines or not. >> >> >> >> >> >> -- >> Karthik Sarma >> UCLA Medical Scientist Training Program Class of 20?? >> Member, UCLA Medical Imaging & Informatics Lab >> Member, CA Delegation to the House of Delegates of the American Medical >> Association >> ksarma@ksarma.com >> gchat: ksarma@gmail.com >> linkedin: www.linkedin.com/in/ksarma >> >> >> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla wrote: >> >>> Just to clarify - with the YTEX branch there are 2 sentence splitter >>> - the >>> original ctakes sentence that splits on newlines, and the ytex sentence >>> splitter that doesn't. the changes to other components in the ytex >>> branch >>> (dependency parser, assertion) work with both sentence splitters. >>> >>> I think it would be great if the intelligence regarding how to split >>> was in >>> the opennlp model, but this requires training data. I don't know >>> what the >>> training data is, or if the training data has sentences that cross >>> newline >>> boundaries (if not, won't buy us anything). >>> >>> vijay >>> >>> >>> >>> >>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean < >>> Sean.Finan@childrens.harvard.edu> wrote: >>> >>>> On my end it looks like my email was reformatted and some of my >>> -newline- >>>> removed in those last examples ... >>>> >>>> -----Original Message----- >>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] >>>> Sent: Wednesday, January 22, 2014 3:42 PM >>>> To: dev@ctakes.apache.org >>>> Subject: RE: sentence detector newline behavior >>>> >>>> Thanks James >>>> >>>>> but then no typical sentence ending punctuation at the end of the >>>>> line >>>> Gotcha. >>>> >>>>> So simply using Lines would not suffice in those cases because it >>>>> would run together sentences where there are more than one on a line >>>> I was actually thinking about something like a Line using -sentence >>>> breaks- in addition to -newline-. In other words, a Sentence being >>>> what >>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences >>>> subdivided by -newline-. Perhaps "Line" is a horrible moniker. >>>> Regardless, it doesn't solve the problem of inappropriately missing >>>> punctuation. I was focused a little more on the difference between >>>> persistent auto- line wrapping and structured information like lists, >>> where >>>> the first benefits from Sentence and the second from Line. >>>> >>>> "The Patient has >>>> been prescribed two >>>> medications." >>>> >>>> "Prescriptions: >>>> Advil >>>> Tylenol >>>> No Aspirin" >>>> >>>> >>>> However, when it comes to the problem that you mention, there is no >>>> benefit to a Line. >>>> >>>> "The patient has been seen six times in the past week. Pain has been >>>> persistent for ten days Advil and Tylenol have been prescribed" >>>> -- 2 sentences, 3 lines >>>> >>>> >>>> "The patient has been seen six times in the past week. >>>> Pain has been persistent for ten days >>>> Advil and Tylenol have been prescribed" >>>> -- 2 sentences, 3 lines >>>> >>>> "The patient has been seen six times in >>>> the past week. Pain has been persistent for ten days Advil and >>> Tylenol >>>> have been prescribed" >>>> -- 2 sentences, 5 lines >>>> >>>> Nothing can really be done for the last bit where punctuation is >>>> missing. >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu] >>>> Sent: Wednesday, January 22, 2014 3:07 PM >>>> To: 'dev@ctakes.apache.org' >>>> Subject: RE: sentence detector newline behavior >>>> >>>> >>>> I know there are notes where there are multiple sentences on a >>>> line, but >>>> then no typical sentence ending punctuation at the end of the line >>>> (or no >>>> punctuation at all at the end of the line). And in those sections, >>> negation >>>> can be important. So simply using Lines would not suffice in those >>>> cases >>>> because it would run together sentences where there are more than >>>> one on >>> a >>>> line. And using sentences alone (as found by OpenNLP 1.5) would not >>> suffice >>>> because it would run together sentences from different lines. >>>> >>>> -----Original Message----- >>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] >>>> Sent: Wednesday, January 22, 2014 1:33 PM >>>> To: dev@ctakes.apache.org >>>> Subject: RE: sentence detector newline behavior >>>> >>>> Just whistling in the wind here ... >>>> >>>> Perhaps before any changes are made to universally toggle cTakes in >>>> one >>>> direction or the other, we can take a poll of when & where >>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed >>> to a >>>> Line (CR/LF delimited PLUS -sentence-) >>>> >>>> If some capabilities like negation detection require -lines- then >>>> would >>> it >>>> make more sense to have Sentence ignore -newline- and negation >>>> detection >>>> itself split the Sentence into line items? If an annotator is >>>> interested >>>> in list items, each of which may be on a distinct -line-, then it can >>> split >>>> up the Sentence as needed. I think that James hints that cTakes code >>>> already does this in some places. >>>> >>>> If a good deal of functionality requires -newline- delimited types, >>>> would >>>> it make sense to introduce a type Line? If something uses a >>>> structured >>>> list it could iterate through Line types, while something using >>>> pure text >>>> could iterate through Sentence types. This facilitates >>> section-by-section >>>> different behavior, does not require any decision on global >>>> defaults, and >>>> makes data selection for training Sentence a nonesuch wrt line breaks. >>>> However, it adds to the system and would require a per-use choice >>> decision >>>> by developers OR a toggle by users (back to the default decision). >>>> Perhaps this has already been tried? >>>> >>>> Sean >>>> >>>> >>>> -----Original Message----- >>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu] >>>> Sent: Wednesday, January 22, 2014 1:06 PM >>>> To: 'dev@ctakes.apache.org' >>>> Subject: RE: sentence detector newline behavior >>>> >>>> The only rule I know of is that cTAKES (prior to ytex integration) >>>> always >>>> forces a sentence break at a newline. >>>> This was because the clinical notes cTAKES original processed never >>>> had >>>> newlines in the middle of a sentence, but did need sentence breaks to >>> occur >>>> at end of sentence for good negation detection on those notes. >>>> I think Guergana earlier mentioned other EMRs also have this need, >>>> but it >>>> seems to not be ubiquitous. >>>> >>>> From others' posts, it seems that we could use an option in cTAKES to >>> turn >>>> off this forcing of sentence breaks at newlines (or depending on >>>> how you >>>> look at it, an option to turn on the forcing of sentence breaks if we >>>> change the default behavior) >>>> >>>> I think we (cTAKES) need to decide the following: >>>> - do we want to do this for entire notes, or would it be worth it to >>>> have it be on a section-by-section basis. >>>> - what do we make the default behavior - to force or not to force >>>> newlines to be sentence breaks >>>> - what data (that contains newlines) will we use for training the >>>> sentence detector >>>> >>>> Regardless of those answers, I think OpenNLP support for including >>>> newlines in training data would be valuable for those others who have >>>> sentences that span lines. And having an option on OpenNLP to always >>> break >>>> at newline would be useful for at least some cTAKES users (and we >>>> could >>>> remove the cTAKES code that does that) >>>> >>>> -- James >>>> >>>> -----Original Message----- >>>> From: dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org [mailto: >>>> dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of >>>> J�rn Kottmann >>>> Sent: Tuesday, January 21, 2014 4:29 AM >>>> To: dev@ctakes.apache.org >>>> Subject: Re: sentence detector newline behavior >>>> >>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model >>>> which can use a new line as a end-of-sentence character. >>>> >>>> In case you have certain rules to split sentences we should have a >>>> look >>> at >>>> them. The Sentence Detector could be extended to support a user >>>> provided >>>> rule based splitter. If there is an interest in that we could probably >>> get >>>> it into 1.6.0 as well. >>>> >>>> J�rn >>>> >>>> On 01/20/2014 10:02 PM, Chen, Pei wrote: >>>>> I presume Joern was suggesting that if he supports new lines in the >>>> opennlp SentenceDectector (either part of the trained models or post >>>> processing with some rules?) cTAKES will be able to use it out of >>>> the box >>>> and we should be able remove any additional custom logic that we >>> currently >>>> have- which seems like a good idea. >>>>> [but when to use within cTAKES individual components such as negation >>>>> might be another discussion?] --Pei >>>>> >>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" >>> wrote: >>>>>> The sentence detection opennlp model used by ctakes does not split >>>>>> sentences at newlines - there is additional logic in the takes >>>>>> sentence splitter that does this (and an alternative impl that >>>>>> doesn't is in the ytex branch). Afaik no retraining / change to the >>>>>> feature representation is necessary. >>>>>> >>>>>> Vj >>>>>> >>>>>>> On Monday, January 20, 2014, J�rn Kottmann >>> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> currently I have quite a bit of time to work on OpenNLP, and would >>>>>>> like to help you out with this issue. >>>>>>> >>>>>>> Here is the follow up issue for this change: >>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602 >>>>>>> >>>>>>> I am still trying to figure out what would be the best option to >>>>>>> implement this. >>>>>>> In the training data a user could just use a special tag to >>>>>>> identify >>>>>>> the chars. >>>>>>> >>>>>>> Instead of it might be better to use and to >>>>>>> encode these two chars in the training data. Any thoughts? >>>>>>> >>>>>>> I am planning to release this as part of OpenNLP 1.6.0. >>>>>>> >>>>>>> Thanks, >>>>>>> J�rn >>>>>>> >>>>>>>> On 05/22/2013 02:03 PM, J�rn Kottmann wrote: >>>>>>>> >>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote: >>>>>>>>> >>>>>>>>> That's awesome! It might be worth trying at least. How does the >>>>>>>>> training process change? Previously the training data would be >>>>>>>>> one >>>>>>>>> sentence per line, but with newlines as possible mid-sentence >>>>>>>>> characters that could be trouble, is there a new representation >>>>>>>>> for training data? Or would we have to use the training api? >>>>>>>> Good point, yes that will be a problem with the default training >>>>>>>> format, but it shouldn't be hard to solve. In the format itself we >>>>>>>> could define a new line tag e.g. >>>>>>>> to mark new lines. >>>>>>>> as a hack to make it work with 1.5.3 you could instead use a >>>>>>>> special char as a replacement for the new line char. >>>>>>>> When you pass the text down to the sentence detector a simple >>>>>>>> string replace could be used to convert all new line chars to the >>>>>>>> special new line marker char. >>>>>>>> >>>>>>>> If things work out for you performance wise as well we will just >>>>>>>> integrate it properly into OpenNLP for the next release. >>>>>>>> >>>>>>>> Could you produce a sentence detector training file with a new >>>>>>>> line >>>>>>>> marker char? >>>>>>>> >>>>>>>> You should try to pick a char you can also pass in on a terminal >>>>>>>> otherwise you have to use the API to train the model. The build in >>>>>>>> cross validation could be used to evaluate the performance. >>>>>>>> >>>>>>>> J�rn >>>> >