Return-Path: X-Original-To: apmail-incubator-ctakes-notifications-archive@minotaur.apache.org Delivered-To: apmail-incubator-ctakes-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D172AE52C for ; Tue, 5 Feb 2013 20:05:13 +0000 (UTC) Received: (qmail 26243 invoked by uid 500); 5 Feb 2013 20:05:13 -0000 Delivered-To: apmail-incubator-ctakes-notifications-archive@incubator.apache.org Received: (qmail 26220 invoked by uid 500); 5 Feb 2013 20:05:13 -0000 Mailing-List: contact ctakes-notifications-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: ctakes-dev@incubator.apache.org Delivered-To: mailing list ctakes-notifications@incubator.apache.org Received: (qmail 26213 invoked by uid 99); 5 Feb 2013 20:05:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2013 20:05:13 +0000 Date: Tue, 5 Feb 2013 20:05:13 +0000 (UTC) From: "James Joseph Masanz (JIRA)" To: ctakes-notifications@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CTAKES-145) inconsistent handling of upper ascii MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CTAKES-145?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1357= 1677#comment-13571677 ]=20 James Joseph Masanz commented on CTAKES-145: -------------------------------------------- cTAKES pipelines written to accept CDA (which is a specific XML) input crea= te a plaintext view, and replace any non (basic) ASCII character with blank= . All the main processing is then done on that plaintext view. cTAKES pipelines written to accept plaintext, do not replace upper ASCII ch= aracters (like the degree symbol used here: =C2=B0C). I created the JIRA issue this morning to track this.=20 I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII -= even when input is CDA. Single byte character set should not affect any o= f the offset-processing cTAKES does. One consideration is that none of the training data used for the sentence d= etector, part of speech tagger or chunker included such characters. What other considerations can people think of? Any objections? -- James =20 > inconsistent handling of upper ascii=20 > ------------------------------------- > > Key: CTAKES-145 > URL: https://issues.apache.org/jira/browse/CTAKES-145 > Project: cTAKES > Issue Type: Task > Components: ctakes-preprocessor > Affects Versions: future enhancement > Reporter: James Joseph Masanz > Priority: Minor > > Currently cTAKES handles character above ascii 127 different depending on= if using a pipeline that processes CDA (Clinical document architecture XML= ) or a pipeline that expects plain text. > The CDA pipelines, as an early step, create a plaintext view that has eac= h upper ascii characters replaced by a blank. > The plaintext pipelines do not do anything special for upper ascii charac= ters. > Example input text for plaintext, to show this behavior:=20 > His name is G=C3=ABrman. Temp is 98 =C2=B0C taken on the forehead > Need to decide if it is OK for this inconsistent behavior or if we should= change one or the other to make them consistent. > See ClinicalNotePreProcessor.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira