Mailing-List: contact notifications-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Date: Wed, 13 Nov 2013 15:03:25 +0000 (UTC)
From: "Tim Miller (JIRA)" <jira@apache.org>
To: notifications@ctakes.apache.org
Message-ID: <JIRA.12679033.1384354976496.66703.1384355005073@arcas>
In-Reply-To: <JIRA.12679033.1384354976496@arcas>
References: <JIRA.12679033.1384354976496@arcas>
Subject: [jira] [Created] (CTAKES-266) tokenizer creates empty tokens before
 contractions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Tim Miller created CTAKES-266:
---------------------------------

             Summary: tokenizer creates empty tokens before contractions
                 Key: CTAKES-266
                 URL: https://issues.apache.org/jira/browse/CTAKES-266
             Project: cTAKES
          Issue Type: Bug
          Components: ctakes-core
    Affects Versions: 3.1
            Reporter: Tim Miller
            Assignee: Tim Miller
            Priority: Minor
             Fix For: 3.1.1


Normally contractions are tokenized as follows:

don't = do + n't

And the code in ContractionsPTB will create a WordToken for the do and a ContractionToken for the n't. (There is some special logic for n't.) There are some weird cases with n't with no preceding text. In my case it was some non-clinical text ("surf n'turf") but you can imagine typos as well (do n't). In these cases the preceding text is actually empty since it is the start of the token, and the code will create an empty WordToken, which can screw up downstream components (I noticed it in the parser). This can be fixed easily by checking for token length of 0 before creating the word token.


--
This message was sent by Atlassian JIRA
(v6.1#6144)