incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Bethard <steven.beth...@Colorado.EDU>
Subject Re: [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii
Date Tue, 05 Feb 2013 20:11:00 GMT
On Feb 5, 2013, at 1:03 PM, "Masanz, James J." <Masanz.James@mayo.edu> wrote:
> I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII

Yes please. Anything that is replacing character instead of using the correct encoding is
just a bug waiting to happen later.

> One consideration is that none of the training data used for the sentence detector, part
of speech tagger or chunker included such characters.

Might be worth running the current models over such text just to make sure things don't break
horribly. I wouldn't expect them to, but you never know…

Steve
Mime
View raw message