Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23988D321 for ; Wed, 22 May 2013 12:03:31 +0000 (UTC) Received: (qmail 91210 invoked by uid 500); 22 May 2013 12:03:31 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 91174 invoked by uid 500); 22 May 2013 12:03:31 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 91147 invoked by uid 99); 22 May 2013 12:03:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 12:03:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 209.85.214.51 as permitted sender) Received: from [209.85.214.51] (HELO mail-bk0-f51.google.com) (209.85.214.51) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 12:03:24 +0000 Received: by mail-bk0-f51.google.com with SMTP id ji2so1042822bkc.24 for ; Wed, 22 May 2013 05:03:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=LOITP6s0GeZNNewcxuIR86HMDejyBoivPCx8rFIG1is=; b=RxCkTve/gI7WiM0flngFdMtqlgkKuHNkGXaFJxU3Zncx9upuOpNq47IlbtL/RPU4bN fWV9qCUGrF9eJfmdH80Nf9Gn/09VfXa5wZZdA1BI3LAvOawVAWDTfOGIPSlnRXb2bE8l tnOgoorb10Qhqlg/3ZbcfW5/9BSvbDPwNqWHT4i2VafhytA7KHf9/Q0lIGhXFQkhSgvx dvLvJHYFQeC7LIjH+urcXg4U+6CFf7sS4/DvhWSEsf3zOIM0dlSZoskyaih3rDc4kfe6 IkydD6OMMf85Px+xJ4O/hD106qSfC6pURVtaLhLWidvH7vrt23u0v9G3GEEyyZGpMsyc EHUw== X-Received: by 10.205.26.200 with SMTP id rn8mr3052285bkb.97.1369224183005; Wed, 22 May 2013 05:03:03 -0700 (PDT) Received: from [192.168.198.5] ([195.218.7.43]) by mx.google.com with ESMTPSA id kw1sm1869727bkb.21.2013.05.22.05.03.01 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 22 May 2013 05:03:02 -0700 (PDT) Message-ID: <519CB3F4.20404@gmail.com> Date: Wed, 22 May 2013 14:03:00 +0200 From: =?ISO-8859-1?Q?J=F6rn_Kottmann?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130404 Thunderbird/17.0.5 MIME-Version: 1.0 To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior References: <996FC801C05DF64A84246A106FACACD010AAD8@MSGPEXCHA08A.mfad.mfroot.org> <519B8C79.7060607@childrens.harvard.edu> <82291210-B468-49DF-BDC0-BAB09C84CAAE@colorado.edu> <01F1B83B-C2EE-45B5-A47B-8BCE096CD419@colorado.edu> <519C8D92.7080407@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org On 05/22/2013 01:17 PM, Miller, Timothy wrote: > That's awesome! It might be worth trying at least. How does the training > process change? Previously the training data would be one sentence per > line, but with newlines as possible mid-sentence characters that could > be trouble, is there a new representation for training data? Or would we > have to use the training api? Good point, yes that will be a problem with the default training format, but it shouldn't be hard to solve. In the format itself we could define a new line tag e.g. to mark new lines. as a hack to make it work with 1.5.3 you could instead use a special char as a replacement for the new line char. When you pass the text down to the sentence detector a simple string replace could be used to convert all new line chars to the special new line marker char. If things work out for you performance wise as well we will just integrate it properly into OpenNLP for the next release. Could you produce a sentence detector training file with a new line marker char? You should try to pick a char you can also pass in on a terminal otherwise you have to use the API to train the model. The build in cross validation could be used to evaluate the performance. J�rn