Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D332217B11 for ; Sat, 27 Sep 2014 13:57:27 +0000 (UTC) Received: (qmail 7766 invoked by uid 500); 27 Sep 2014 13:57:27 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 7706 invoked by uid 500); 27 Sep 2014 13:57:27 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 7694 invoked by uid 99); 27 Sep 2014 13:57:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Sep 2014 13:57:27 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Timothy.Miller@childrens.harvard.edu designates 134.174.13.92 as permitted sender) Received: from [134.174.13.92] (HELO mailsmtp2.childrenshospital.org) (134.174.13.92) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Sep 2014 13:57:00 +0000 Received: from pps.filterd (mailsmtp2.childrenshospital.org [127.0.0.1]) by mailsmtp2.childrenshospital.org (8.14.7/8.14.7) with SMTP id s8RDsmJK019912 for ; Sat, 27 Sep 2014 09:56:54 -0400 Received: from smtpndc2.chboston.org (smtpndc2.chboston.org [10.20.50.105]) by mailsmtp2.childrenshospital.org with ESMTP id 1pnkjwsdha-1 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Sat, 27 Sep 2014 09:56:54 -0400 Received: from pps.filterd (smtpndc2.chboston.org [127.0.0.1]) by smtpndc2.chboston.org (8.14.7/8.14.7) with SMTP id s8RDrjsA029672 for ; Sat, 27 Sep 2014 09:56:53 -0400 Received: from chexhubcasbdc1.chboston.org (chexhubcasbdc1.chboston.org [10.20.18.71]) by smtpndc2.chboston.org with ESMTP id 1pnkhtrrh9-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Sat, 27 Sep 2014 09:56:53 -0400 Received: from CHEXMBX3A.CHBOSTON.ORG ([fe80::8df1:9966:b0b0:841d]) by CHEXHUBCASBDC1.CHBOSTON.ORG ([::1]) with mapi id 14.03.0169.001; Sat, 27 Sep 2014 09:56:53 -0400 From: "Miller, Timothy" To: "dev@ctakes.apache.org" Subject: sentence detector model Thread-Topic: sentence detector model Thread-Index: Ac/aWuQuwh8ZtN9BTOq4TEV9HQbhPg== Date: Sat, 27 Sep 2014 13:56:52 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.18.21.55] Content-Type: multipart/alternative; boundary="_000_E084D8EFE2B03A408B324458C5212E94245C77D8CHEXMBX3ACHBOST_" MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.12.52,1.0.28,0.0.0000 definitions=2014-09-26_07:2014-09-26,2014-09-26,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.12.52,1.0.28,0.0.0000 definitions=2014-09-26_07:2014-09-26,2014-09-26,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=0 kscore.compositescore=0 circleOfTrustscore=330.16 compositescore=0.125794279544301 urlsuspect_oldscore=0.125794279544301 suspectscore=0 recipient_domain_to_sender_totalscore=2419 phishscore=0 bulkscore=0 kscore.is_spamscore=0 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=31594 rbsscore=0.125794279544301 spamscore=0 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1409270145 X-Virus-Checked: Checked by ClamAV on apache.org --_000_E084D8EFE2B03A408B324458C5212E94245C77D8CHEXMBX3ACHBOST_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I have been working on the sentence detector newline issue, training a mode= l to probabilistically split sentences on newlines rather than forcing sent= ence breaks. I have checked in a model to the repo under ctakes-core-res. I= also attached a patch to ctakes-core to the jira issue: https://issues.apache.org/jira/browse/CTAKES-41 for people to test. The status of my testing is that it doesn't seem to bre= ak on notes where ctakes worked well before (those where newlines are alway= s sentence breaks), and is a slight improvement on notes where newlines may= or may not be sentence breaks. Once the change is checked in we can contin= ue improving the model by adding more data and features, but the first hurd= le I'd like to get past is making sure it runs well enough on the type of d= ata that the old model worked well on. Let me know if you have any question= s. Thanks Tim --_000_E084D8EFE2B03A408B324458C5212E94245C77D8CHEXMBX3ACHBOST_--