incubator-ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Joseph Masanz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CTAKES-145) inconsistent handling of upper ascii
Date Tue, 05 Feb 2013 20:05:13 GMT

    [ https://issues.apache.org/jira/browse/CTAKES-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571677#comment-13571677
] 

James Joseph Masanz commented on CTAKES-145:
--------------------------------------------


cTAKES pipelines written to accept CDA (which is a specific XML) input create a plaintext
view, and replace any non (basic) ASCII character with blank. All the main processing is then
done on that plaintext view.

cTAKES pipelines written to accept plaintext, do not replace upper ASCII characters (like
the degree symbol used here: °C).

I created the JIRA issue this morning to track this. 
I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII - even when input
is CDA.  Single byte character set should not affect any of the offset-processing cTAKES does.

One consideration is that none of the training data used for the sentence detector, part of
speech tagger or chunker included such characters.

What other considerations can people think of?

Any objections?

-- James


                
> inconsistent handling of upper ascii 
> -------------------------------------
>
>                 Key: CTAKES-145
>                 URL: https://issues.apache.org/jira/browse/CTAKES-145
>             Project: cTAKES
>          Issue Type: Task
>          Components: ctakes-preprocessor
>    Affects Versions: future enhancement
>            Reporter: James Joseph Masanz
>            Priority: Minor
>
> Currently cTAKES handles character above ascii 127 different depending on if using a
pipeline that processes CDA (Clinical document architecture XML) or a pipeline that expects
plain text.
> The CDA pipelines, as an early step, create a plaintext view that has each upper ascii
characters replaced by a blank.
> The plaintext pipelines do not do anything special for upper ascii characters.
> Example input text for plaintext, to show this behavior: 
> His name is Gërman. Temp is 98 °C taken on the forehead
> Need to decide if it is OK for this inconsistent behavior or if we should change one
or the other to make them consistent.
> See ClinicalNotePreProcessor.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message