ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CTAKES-266) tokenizer creates empty tokens before contractions
Date Wed, 13 Nov 2013 15:33:56 GMT

    [ https://issues.apache.org/jira/browse/CTAKES-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821424#comment-13821424
] 

ASF subversion and git services commented on CTAKES-266:
--------------------------------------------------------

Commit 1541553 from [~tmill] in branch 'ctakes/trunk'
[ https://svn.apache.org/r1541553 ]

Fixes CTAKES-266. Checks for zero-length word token before creating token before contraction.

> tokenizer creates empty tokens before contractions
> --------------------------------------------------
>
>                 Key: CTAKES-266
>                 URL: https://issues.apache.org/jira/browse/CTAKES-266
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-core
>    Affects Versions: 3.1
>            Reporter: Tim Miller
>            Assignee: Tim Miller
>            Priority: Minor
>             Fix For: 3.1.1
>
>
> Normally contractions are tokenized as follows:
> don't = do + n't
> And the code in ContractionsPTB will create a WordToken for the do and a ContractionToken
for the n't. (There is some special logic for n't.) There are some weird cases with n't with
no preceding text. In my case it was some non-clinical text ("surf n'turf") but you can imagine
typos as well (do n't). In these cases the preceding text is actually empty since it is the
start of the token, and the code will create an empty WordToken, which can screw up downstream
components (I noticed it in the parser). This can be fixed easily by checking for token length
of 0 before creating the word token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message