incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <>
Subject [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii
Date Tue, 05 Feb 2013 20:03:39 GMT

cTAKES pipelines written to accept CDA (which is a specific XML) input create a plaintext
view, and replace any non (basic) ASCII character with blank. All the main processing is then
done on that plaintext view.

cTAKES pipelines written to accept plaintext, do not replace upper ASCII characters (like
the degree symbol used here: °C).

I created the JIRA issue this morning to track this. 
I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII - even when input
is CDA.  Single byte character set should not affect any of the offset-processing cTAKES does.

One consideration is that none of the training data used for the sentence detector, part of
speech tagger or chunker included such characters.

What other considerations can people think of?

Any objections?

-- James

> -----Original Message-----
> From: ctakes-notifications-return-287-
> [mailto:ctakes-notifications-
>] On Behalf Of James
> Joseph Masanz (JIRA)
> Sent: Tuesday, February 05, 2013 10:36 AM
> To:
> Subject: [jira] [Created] (CTAKES-145) inconsistent handling of upper
> ascii
> James Joseph Masanz created CTAKES-145:
> ------------------------------------------
>              Summary: inconsistent handling of upper ascii
>                  Key: CTAKES-145
>                  URL:
>              Project: cTAKES
>           Issue Type: Task
>           Components: ctakes-preprocessor
>     Affects Versions: future enhancement
>             Reporter: James Joseph Masanz
>             Priority: Minor
> Currently cTAKES handles character above ascii 127 different depending on
> if using a pipeline that processes CDA (Clinical document architecture
> XML) or a pipeline that expects plain text.
> The CDA pipelines, as an early step, create a plaintext view that has each
> upper ascii characters replaced by a blank.
> The plaintext pipelines do not do anything special for upper ascii
> characters.
> Example input text for plaintext, to show this behavior:
> His name is Gërman. Temp is 98 °C taken on the forehead
> Need to decide if it is OK for this inconsistent behavior or if we should
> change one or the other to make them consistent.
> See
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators For more information on JIRA, see:
View raw message