ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pei Chen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CTAKES-254) Apostrophe in contraction breaks TokenizerPTB
Date Tue, 29 Oct 2013 22:01:25 GMT
Pei Chen created CTAKES-254:
-------------------------------

             Summary: Apostrophe in contraction breaks TokenizerPTB
                 Key: CTAKES-254
                 URL: https://issues.apache.org/jira/browse/CTAKES-254
             Project: cTAKES
          Issue Type: Bug
          Components: ctakes-core
    Affects Versions: 3.1
            Reporter: Pei Chen
            Priority: Blocker
             Fix For: 3.1.1


Sample text: "on n'tion"
The single char followed by apostrophe will break the TokenizerPTB.
What the heck?
Results in a OutOfBoundsException
org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB.setNumPosition(TokenizerPTB.java 1147)

Sean Finan already had a patch for this sometime ago, but just wanted to see if we missed
something else here:
See below to add a check for empty string in the token:
Starting at line 1145:

// START

private void setNumPosition(WordToken wta, String tokenText) {
      if ( tokenText.isEmpty() ) {
         // was getting ioobE from tokenText.charAt(..)
         // Possibilities like this (empty, null) should always be checked
         // - but I wonder that we get (want) empty tokens at all.
         // I believe that working with zero-length words is a bug, and this is not a fix
it merely avoids a crash.
         wta.setNumPosition( TokenizerAnnotator.TOKEN_NUM_POS_NONE );
         return;
      }

    if (isDigit(tokenText.charAt(0)))  {

// END




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message