ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CTAKES-254) Apostrophe in contraction breaks TokenizerPTB
Date Mon, 04 Nov 2013 16:09:17 GMT

    [ https://issues.apache.org/jira/browse/CTAKES-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812953#comment-13812953
] 

ASF subversion and git services commented on CTAKES-254:
--------------------------------------------------------

Commit 1538660 from chenpei@apache.org in branch 'ctakes/trunk'
[ https://svn.apache.org/r1538660 ]

CTAKES-254 - Add empty string check Apostrophe in contraction breaks TokenizerPTB
CTAKES-256 - Add a test doc that contains various edge cases for regression testing

> Apostrophe in contraction breaks TokenizerPTB
> ---------------------------------------------
>
>                 Key: CTAKES-254
>                 URL: https://issues.apache.org/jira/browse/CTAKES-254
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-core
>    Affects Versions: 3.1
>            Reporter: Pei Chen
>            Priority: Blocker
>             Fix For: 3.1.1
>
>
> Sample text: "on n'tion"
> The single char followed by apostrophe will break the TokenizerPTB.
> What the heck?
> Results in a OutOfBoundsException
> org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB.setNumPosition(TokenizerPTB.java 1147)
> Sean Finan already had a patch for this sometime ago, but just wanted to see if we missed
something else here:
> See below to add a check for empty string in the token:
> Starting at line 1145:
> // START
> private void setNumPosition(WordToken wta, String tokenText) {
>       if ( tokenText.isEmpty() ) {
>          // was getting ioobE from tokenText.charAt(..)
>          // Possibilities like this (empty, null) should always be checked
>          // - but I wonder that we get (want) empty tokens at all.
>          // I believe that working with zero-length words is a bug, and this is not a
fix it merely avoids a crash.
>          wta.setNumPosition( TokenizerAnnotator.TOKEN_NUM_POS_NONE );
>          return;
>       }
>     if (isDigit(tokenText.charAt(0)))  {
> // END



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message