Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A326010AB1 for ; Mon, 27 Jan 2014 23:03:49 +0000 (UTC) Received: (qmail 92121 invoked by uid 500); 27 Jan 2014 23:03:48 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 91967 invoked by uid 500); 27 Jan 2014 23:03:47 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 91927 invoked by uid 99); 27 Jan 2014 23:03:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 23:03:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vngarla@gmail.com designates 74.125.82.50 as permitted sender) Received: from [74.125.82.50] (HELO mail-wg0-f50.google.com) (74.125.82.50) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 23:03:43 +0000 Received: by mail-wg0-f50.google.com with SMTP id l18so6506393wgh.29 for ; Mon, 27 Jan 2014 15:03:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=H6VB0HZkHWXCQ/FtRPBV5OvCRz4DtKPKIkv5j/bcZIs=; b=mxawZVidJuR8NKhAvNmBWigREZTlo04fAP/bEY5XTlkdMYPdAzFEa4Y/VH4M55e4QF 8dgfNGbjYxdeIKcxkEoOdxsXTZsqakp1PRcRNHmpdVPw8n6pYJxocTyJiLT09HNVX2YA TxRvT2lzncdJwZa3rlQyWpztf+6RQwS2JjJxOrfsWJhJYrY3b1nHo9QSQ1bYe91H8GrJ SJJ1o7HYWLd29cLv69IF8syRFuv2+MB0iga9FdRvCogyIeh/n3DjoUhJ1lUcQMv7+ajN 0BGDMhwmalTANUt2jZQizjdH4UpRf6W6NHsOupZ36H6KFsUhHASyrkzhNm3lPRDoFvZT +ziA== MIME-Version: 1.0 X-Received: by 10.181.11.133 with SMTP id ei5mr13639417wid.2.1390863801911; Mon, 27 Jan 2014 15:03:21 -0800 (PST) Received: by 10.227.16.136 with HTTP; Mon, 27 Jan 2014 15:03:21 -0800 (PST) In-Reply-To: References: <52DE4BFD.803@gmail.com> <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG> <52E1844D.3010507@childrens.harvard.edu> <52E2D79C.60101@gmail.com> <52E3F32B.8090604@gmail.com> <52E5227F.1000506@gmail.com> <52E63E8B.9080906@gmail.com> <52E67290.4080105@childrens.harvard.edu> Date: Mon, 27 Jan 2014 18:03:21 -0500 Message-ID: Subject: Re: sentence detector newline behavior From: vijay garla To: "dev@ctakes.apache.org" Content-Type: multipart/alternative; boundary=f46d043bdec6d8103f04f0fbb625 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043bdec6d8103f04f0fbb625 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable For clarity, I'd like to stress that the opennlp sentence model distributed with ctakes today does 'work' with sentences that span newlines - as I understand it, this model ignores newline tokens (or newlines are not provided as features to that model). I believe the improvements Tim and others are suggesting are for a new sentence model + feature representation that takes advantage of newlines as features. Whatever we do, I believe we need backwards compatibility - those who are using the current sentence model may need to continue using it. To that end: * If we upgrade to the newest version of opennlp, will the old model work (and produce the same results)? * If a contributor trains a new model that uses a different feature representation, I believe that should go into a new Sentence Detector AnalysisEngine (or the same AE but with different configuration parameters), so users have a choice between the old and the new. -vj On Mon, Jan 27, 2014 at 1:09 PM, digital paula wro= te: > > > > Tim, > > I just had to chime in on a comment you made. My deadline has been > extended a bit on my pressing issue but I do intend to get back to testin= g > per VJ's fix or maybe another fix is in the works based on latest > emails...I need to read them again since a lot has been stated on the iss= ue. > > Okay, as a new user (working w/cTAKES since October) I have never thought > what you had stated: > > "And I think this is the kind of thing that can leave new users > scratching their heads and doubting our overall competence." > > Yeah, the sentence-spanning-newline issue was a problem so I just brought > attention to it by my post of inquiry earlier this month on VJ's fix from > last month and worked around it with treating narrative as one string. > > Anyone who's looked at the code would appreciate and acknowledge that > cTAKES is a powerful and complex application. I'm overall impressed with > it and I intend to continue to use it, improve it, and grow with it. I'v= e > been delving deeper into cTAKES on the machine learning aspect...I'm > struggling a bit with it and if anything I scratch my head and doubt my > competence. ;-) > > Regards, > Paula > > > Date: Mon, 27 Jan 2014 09:52:00 -0500 > > From: timothy.miller@childrens.harvard.edu > > To: dev@ctakes.apache.org > > Subject: Re: sentence detector newline behavior > > > > OK, with the most recent version I am able to replicate the performance > > I was getting before. Thanks a lot J=F6rn! > > > > Assuming this is in the next incremental release of opennlp, how quickl= y > > can we get a re-trained model into cTAKES? I heard from a researcher at > > AMIA who tried cTAKES and because of this bug in the way we handle > > sentences was trying to find an outside sentence detector as a > > preprocess to cTAKES, and frankly that is insane. We should be able to > > get something this simple right. And I think this is the kind of thing > > that can leave new users scratching their heads and doubting our overal= l > > competence. > > > > James, I believe you are usually the one who rebuilds the models? What > > would be the best way to incorporate the data I have that has some > > instances of non-sentence terminating newlines? > > > > Tim > > > > > > On 01/27/2014 06:10 AM, J=F6rn Kottmann wrote: > > > On 01/26/2014 11:29 PM, Miller, Timothy wrote: > > >> Yes, this fixes the whitespace sentence issue but the evaluation iss= ue > > >> remains. I believe the problem is in SentenceSampleStream, where in > the > > >> following block the whitespace trim happens before the characte= r > is > > >> replaced with the \n character. So test sentences that ended with > > >> will be one character longer than they should be. > > >> > > >>> > sentence =3D sentence.trim(); > > >>> > sentence =3D replaceNewLineEscapeTags(sentence); > > >>> > sentencesString.append(sentence); > > >>> > int end =3D sentencesString.length(); > > >>> > sentenceSpans.add(new Span(begin, end)); > > >>> > sentencesString.append(' '); > > > > > > Yes, that must be the issue. During training the new line is inlucded > > > in the span, and during > > > detection the white space remover creates a span without the new line > > > char. > > > > > > I suggest that the evaluator just ignores white space differences > > > between sentences. My test case then > > > has the expected performance numbers. > > > > > > What do you think? > > > > > > Anyway, I committed the change. Please give it a try. > > > > > > J=F6rn > > > > > --f46d043bdec6d8103f04f0fbb625--