Return-Path: X-Original-To: apmail-ctakes-notifications-archive@www.apache.org Delivered-To: apmail-ctakes-notifications-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E2B9510C80 for ; Wed, 13 Nov 2013 15:03:44 +0000 (UTC) Received: (qmail 38984 invoked by uid 500); 13 Nov 2013 15:03:31 -0000 Delivered-To: apmail-ctakes-notifications-archive@ctakes.apache.org Received: (qmail 38949 invoked by uid 500); 13 Nov 2013 15:03:26 -0000 Mailing-List: contact notifications-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list notifications@ctakes.apache.org Received: (qmail 38907 invoked by uid 99); 13 Nov 2013 15:03:25 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Nov 2013 15:03:25 +0000 Date: Wed, 13 Nov 2013 15:03:25 +0000 (UTC) From: "Tim Miller (JIRA)" To: notifications@ctakes.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (CTAKES-266) tokenizer creates empty tokens before contractions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Tim Miller created CTAKES-266: --------------------------------- Summary: tokenizer creates empty tokens before contractions Key: CTAKES-266 URL: https://issues.apache.org/jira/browse/CTAKES-266 Project: cTAKES Issue Type: Bug Components: ctakes-core Affects Versions: 3.1 Reporter: Tim Miller Assignee: Tim Miller Priority: Minor Fix For: 3.1.1 Normally contractions are tokenized as follows: don't = do + n't And the code in ContractionsPTB will create a WordToken for the do and a ContractionToken for the n't. (There is some special logic for n't.) There are some weird cases with n't with no preceding text. In my case it was some non-clinical text ("surf n'turf") but you can imagine typos as well (do n't). In these cases the preceding text is actually empty since it is the start of the token, and the code will create an empty WordToken, which can screw up downstream components (I noticed it in the parser). This can be fixed easily by checking for token length of 0 before creating the word token. -- This message was sent by Atlassian JIRA (v6.1#6144)