ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pei Chen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CTAKES-254) Apostrophe in contraction breaks TokenizerPTB
Date Mon, 04 Nov 2013 16:09:20 GMT

     [ https://issues.apache.org/jira/browse/CTAKES-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pei Chen resolved CTAKES-254.
-----------------------------

    Resolution: Fixed
      Assignee: Pei Chen

Fixed in trunk

> Apostrophe in contraction breaks TokenizerPTB
> ---------------------------------------------
>
>                 Key: CTAKES-254
>                 URL: https://issues.apache.org/jira/browse/CTAKES-254
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-core
>    Affects Versions: 3.1
>            Reporter: Pei Chen
>            Assignee: Pei Chen
>            Priority: Blocker
>             Fix For: 3.1.1
>
>
> Sample text: "on n'tion"
> The single char followed by apostrophe will break the TokenizerPTB.
> What the heck?
> Results in a OutOfBoundsException
> org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB.setNumPosition(TokenizerPTB.java 1147)
> Sean Finan already had a patch for this sometime ago, but just wanted to see if we missed
something else here:
> See below to add a check for empty string in the token:
> Starting at line 1145:
> // START
> private void setNumPosition(WordToken wta, String tokenText) {
>       if ( tokenText.isEmpty() ) {
>          // was getting ioobE from tokenText.charAt(..)
>          // Possibilities like this (empty, null) should always be checked
>          // - but I wonder that we get (want) empty tokens at all.
>          // I believe that working with zero-length words is a bug, and this is not a
fix it merely avoids a crash.
>          wta.setNumPosition( TokenizerAnnotator.TOKEN_NUM_POS_NONE );
>          return;
>       }
>     if (isDigit(tokenText.charAt(0)))  {
> // END



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message