Return-Path: X-Original-To: apmail-ctakes-notifications-archive@www.apache.org Delivered-To: apmail-ctakes-notifications-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB0F118281 for ; Thu, 30 Apr 2015 20:15:08 +0000 (UTC) Received: (qmail 74398 invoked by uid 500); 30 Apr 2015 20:15:08 -0000 Delivered-To: apmail-ctakes-notifications-archive@ctakes.apache.org Received: (qmail 74370 invoked by uid 500); 30 Apr 2015 20:15:08 -0000 Mailing-List: contact notifications-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list notifications@ctakes.apache.org Received: (qmail 74361 invoked by uid 99); 30 Apr 2015 20:15:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Apr 2015 20:15:08 +0000 Date: Thu, 30 Apr 2015 20:15:08 +0000 (UTC) From: "Pei Chen (JIRA)" To: notifications@ctakes.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei Chen updated CTAKES-155: ---------------------------- Fix Version/s: (was: 3.2.2) 3.2.3 > SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters > ------------------------------------------------------------------------- > > Key: CTAKES-155 > URL: https://issues.apache.org/jira/browse/CTAKES-155 > Project: cTAKES > Issue Type: Bug > Components: ctakes-core > Affects Versions: 3.0-incubating > Reporter: Steven Bethard > Fix For: 3.2.3 > > > The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I believe it assumes all sections are 5 characters long here: > {code:java} > fileReader.read(sectIdArr, 0, 5); > {code} > As a result, when the section name is longer than that, some part of the section heading (e.g. for a 6 letter section name, the final "]") is left in the text of the next section. This results, for example, in the dependency parser choking: > {code:java} > Caused by: java.lang.NullPointerException > at clear.pos.PosEnLib.isNoun(PosEnLib.java:56) > at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273) > at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247) > {code} > I would fix this but: > (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's documentation actually says "Creates a single segment annotation that spans the entire document" which is just untrue, so I'm not really sure what this annotator is intended to do. > (2) Even if I make some assumptions about what it's intended to do, the code is written in an extremely brittle fashion, and I'm afraid to make changes to that. For what it's worth, here's what I think the annotator should really look like: > {code:java} > public static class SegmentsFromBracketedSectionTagsAnnotator extends JCasAnnotator_ImplBase { > private static Pattern SECTION_PATTERN = > Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end section id=\"?(.*?)\"?\\])", Pattern.DOTALL); > @Override > public void process(JCas jCas) throws AnalysisEngineProcessException { > Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText()); > while (matcher.find()) { > Segment segment = new Segment(jCas); > segment.setBegin(matcher.start() + matcher.group(1).length()); > segment.setEnd(matcher.end() - matcher.group(3).length()); > segment.setId(matcher.group(2)); > segment.addToIndexes(); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)