Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D973010844 for ; Sun, 26 Jan 2014 14:59:07 +0000 (UTC) Received: (qmail 43015 invoked by uid 500); 26 Jan 2014 14:59:07 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 42893 invoked by uid 500); 26 Jan 2014 14:59:02 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 42880 invoked by uid 99); 26 Jan 2014 14:59:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Jan 2014 14:59:00 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kottmann@gmail.com designates 74.125.83.49 as permitted sender) Received: from [74.125.83.49] (HELO mail-ee0-f49.google.com) (74.125.83.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Jan 2014 14:58:52 +0000 Received: by mail-ee0-f49.google.com with SMTP id d17so1797252eek.36 for ; Sun, 26 Jan 2014 06:58:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=KsQBFOFKiFiMTwBGiFh7RQctHBX8Fx+GMdXXaYdYTkg=; b=rk8Zrf7illTwpP0X/LBnK+Ava8NjO7qpvj+xxCtb+NRkuZLZGq9dLetEfZbFRLpl8S FV2JLkuSGiTR1+5T90QW6FDeG9YtSQsuAEJJQFHQP5/V61XnUbznT/b9i67z1k0hrzir Ym4NRK4Dizx8h8iM02JUyr52K8ZF0/qcXj1LfEho8FUEAZ60rOumHxnQPJNDWQXeeypQ AwAeuIbuJBKSmaEg8kS/jXHqIBiqimp23m7YdrFP+DOl9qDxf+7TQnR3xmHnAiTH+YMQ U1NwS1S8+SK4CBk/1BgjzfbYMXegjrfLUzkbWBPh0sqaOCDCsRKmBm4aQk0Q90zuvUQa 0xsA== X-Received: by 10.14.148.138 with SMTP id v10mr21259209eej.37.1390748312333; Sun, 26 Jan 2014 06:58:32 -0800 (PST) Received: from [192.168.11.40] (12.21-218-195.adsl.internet.lu. [195.218.21.12]) by mx.google.com with ESMTPSA id b41sm29974679eef.16.2014.01.26.06.58.30 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 26 Jan 2014 06:58:31 -0800 (PST) Message-ID: <52E5227F.1000506@gmail.com> Date: Sun, 26 Jan 2014 15:58:07 +0100 From: =?ISO-8859-1?Q?J=F6rn_Kottmann?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0 MIME-Version: 1.0 To: "dev@ctakes.apache.org" Subject: Re: sentence detector newline behavior References: <519CB3F4.20404@gmail.com> <52DD23AF.3090105@gmail.com> <52DE4BFD.803@gmail.com> <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG> <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG> <52E1844D.3010507@childrens.harvard.edu> <52E2D79C.60101@gmail.com> <52E3F32B.8090604@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org On 01/25/2014 10:03 PM, Miller, Timothy wrote: > On 01/25/2014 12:24 PM, J�rn Kottmann wrote: >> The code which computes the spans tries to remove white space from it. >> Removing the white space from a whitespace only sentence is causing >> the exception your are seeing. Which response would you expect from >> the sentence detector? Should a white space only sentence be returned? > I would say no. > >> In case a sentence is terminated by a new line. Should the new line >> char be included in the sentence span or not? > I would also say no. > > > I made a quick patch for this issue -- now it runs but scores really > poorly compared to my model file (30 vs 75 or so). I suspect something > is wrong with the evaluation, the spans being slightly off somehow. The evaluation should ignore white spaces. I committed now my fix, it would be nice if you can test it. There might be still something wrong. In my test data I replaced all question marks with white spaces, and the result is slightly worse than with the original data. J�rn