Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D526D10F8D for ; Mon, 27 Jan 2014 19:53:43 +0000 (UTC) Received: (qmail 16890 invoked by uid 500); 27 Jan 2014 19:53:43 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 16809 invoked by uid 500); 27 Jan 2014 19:53:43 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 16801 invoked by uid 99); 27 Jan 2014 19:53:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 19:53:42 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=MSGID_FROM_MTA_HEADER,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [129.176.212.47] (HELO mail10.mayo.edu) (129.176.212.47) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 19:53:37 +0000 Received: from unknown (HELO mail10.mayo.edu) ([10.146.65.140]) by ironport10-dlp.mayo.edu with ESMTP; 27 Jan 2014 13:53:17 -0600 Message-Id: X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAF645lIKgMjL/2dsb2JhbABarjKSQ4EvdIIlAQEBAwE8AUEHBAIBCA0EBAEBAScHRgkIAQEEExyHYQ3BcoR/jnkyBoJ3gTsEnxaOXA Received: from unknown (HELO msgoms03.mayo.edu) ([10.128.200.203]) by ironport10.mayo.edu with ESMTP; 27 Jan 2014 13:53:16 -0600 Date: Mon, 27 Jan 2014 19:53:15 +0000 From: "Masanz, James J." Subject: RE: sentence detector newline behavior In-reply-to: <52E6B739.3030308@childrens.harvard.edu> To: "'dev@ctakes.apache.org'" MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-language: en-US Content-transfer-encoding: quoted-printable Accept-Language: en-US Thread-topic: sentence detector newline behavior Thread-index: Ac5WG7UABEU57a2lTXuxbf1ulBTIZDFZukmAAAfAYwAAAv9O8AAHO9+AAAx8M5A= X-MS-Has-Attach: X-MS-TNEF-Correlator: References: <52DE4BFD.803@gmail.com> <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG> <52E1844D.3010507@childrens.harvard.edu> <52E2D79C.60101@gmail.com> <52E3F32B.8090604@gmail.com> <52E5227F.1000506@gmail.com> <52E63E8B.9080906@gmail.com> <52E67290.4080105@childrens.harvard.edu> <52E6B739.3030308@childrens.harvard.edu> X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org I didn't write the cTAKES sentence detector so I can't answer definitively = but I do know it was originally written using what is now a pretty old vers= ion of OpenNLP and needed some things you couldn't get from the out-of-the-= box OpenNLP at the time. From what I remember the things specific to it we= re=20 - the list of end of sentence candidate characters=20 - and the handling of newlines -- James -----Original Message----- From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu]=20 Sent: Monday, January 27, 2014 1:45 PM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior On 01/27/2014 02:35 PM, Masanz, James J. wrote: > Tim, is the training data something you can share publicly? Or privately?= I can't publicly share the data that has been used to train the sentence = detector, I can only share the models that get built. And you can't build a= model from an existing model + more data, you need all the training data t= ogether. It is from the MIMIC corpus which I definitely can't share publicly, but=20 it's worth looking into whether I could share it privately with another=20 person who has a signed data use agreement. > Regarding how quickly we can get this out there, I can train a new senten= ce detector in a day or two. But that's just the first step - to really inc= orporate this, I would suggest this be a point release. We would need a r= elease manager for that. Right now I don't have time for that. I haven't = heard a consensus saying whether this should be the new behavior. Yeah I suppose this is subject to the scale of the changes we make. > From what I remember we are going to need code changes to make optional = the code that splits at line breaks, or was your test replacing the existin= g cTAKES sentence detector and just using OpenNLP directly. That is a good point, and something I was wondering about. Having now=20 looked at both the ctakes and opennlp code for the sentence splitter it=20 seems like there is a lot of overlap. I would've thought it was just a=20 matter of converting annotations into our type system. So I'm curious if=20 there is some justification for why there seems to be duplication (or if=20 I'm hallucinating it). Tim > > -- James > > -----Original Message----- > From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu] > Sent: Monday, January 27, 2014 8:52 AM > To: dev@ctakes.apache.org > Subject: Re: sentence detector newline behavior > > OK, with the most recent version I am able to replicate the performance > I was getting before. Thanks a lot J=F6rn! > > Assuming this is in the next incremental release of opennlp, how quickly > can we get a re-trained model into cTAKES? I heard from a researcher at > AMIA who tried cTAKES and because of this bug in the way we handle > sentences was trying to find an outside sentence detector as a > preprocess to cTAKES, and frankly that is insane. We should be able to > get something this simple right. And I think this is the kind of thing > that can leave new users scratching their heads and doubting our overall > competence. > > James, I believe you are usually the one who rebuilds the models? What > would be the best way to incorporate the data I have that has some > instances of non-sentence terminating newlines? > > Tim > > > On 01/27/2014 06:10 AM, J=F6rn Kottmann wrote: >> On 01/26/2014 11:29 PM, Miller, Timothy wrote: >>> Yes, this fixes the whitespace sentence issue but the evaluation issue >>> remains. I believe the problem is in SentenceSampleStream, where in the >>> following block the whitespace trim happens before the character i= s >>> replaced with the \n character. So test sentences that ended with >>> will be one character longer than they should be. >>> >>>>> sentence =3D sentence.trim(); >>>>> sentence =3D replaceNewLineEscapeTags(sentence); >>>>> sentencesString.append(sentence); >>>>> int end =3D sentencesString.length(); >>>>> sentenceSpans.add(new Span(begin, end)); >>>>> sentencesString.append(' '); >> Yes, that must be the issue. During training the new line is inlucded >> in the span, and during >> detection the white space remover creates a span without the new line >> char. >> >> I suggest that the evaluator just ignores white space differences >> between sentences. My test case then >> has the expected performance numbers. >> >> What do you think? >> >> Anyway, I committed the change. Please give it a try. >> >> J=F6rn