Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6ADC4200C4E for ; Thu, 6 Apr 2017 17:55:46 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 697A7160B83; Thu, 6 Apr 2017 15:55:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B0F6D160B9F for ; Thu, 6 Apr 2017 17:55:45 +0200 (CEST) Received: (qmail 89880 invoked by uid 500); 6 Apr 2017 15:55:44 -0000 Mailing-List: contact notifications-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list notifications@ctakes.apache.org Received: (qmail 89871 invoked by uid 99); 6 Apr 2017 15:55:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Apr 2017 15:55:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 42AFA1803A4 for ; Thu, 6 Apr 2017 15:55:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 6YoRaVziu5Bp for ; Thu, 6 Apr 2017 15:55:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8E5935F24E for ; Thu, 6 Apr 2017 15:55:42 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id F050DE0A6C for ; Thu, 6 Apr 2017 15:55:41 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A515B24066 for ; Thu, 6 Apr 2017 15:55:41 +0000 (UTC) Date: Thu, 6 Apr 2017 15:55:41 +0000 (UTC) From: "James Joseph Masanz (JIRA)" To: notifications@ctakes.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 06 Apr 2017 15:55:46 -0000 [ https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Joseph Masanz updated CTAKES-155: --------------------------------------- Fix Version/s: (was: 3.2.3) future enhancement > SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters > ------------------------------------------------------------------------- > > Key: CTAKES-155 > URL: https://issues.apache.org/jira/browse/CTAKES-155 > Project: cTAKES > Issue Type: Bug > Components: ctakes-core > Affects Versions: 3.0-incubating > Reporter: Steven Bethard > Fix For: future enhancement > > > The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I believe it assumes all sections are 5 characters long here: > {code:java} > fileReader.read(sectIdArr, 0, 5); > {code} > As a result, when the section name is longer than that, some part of the section heading (e.g. for a 6 letter section name, the final "]") is left in the text of the next section. This results, for example, in the dependency parser choking: > {code:java} > Caused by: java.lang.NullPointerException > at clear.pos.PosEnLib.isNoun(PosEnLib.java:56) > at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273) > at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247) > {code} > I would fix this but: > (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's documentation actually says "Creates a single segment annotation that spans the entire document" which is just untrue, so I'm not really sure what this annotator is intended to do. > (2) Even if I make some assumptions about what it's intended to do, the code is written in an extremely brittle fashion, and I'm afraid to make changes to that. For what it's worth, here's what I think the annotator should really look like: > {code:java} > public static class SegmentsFromBracketedSectionTagsAnnotator extends JCasAnnotator_ImplBase { > private static Pattern SECTION_PATTERN = > Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end section id=\"?(.*?)\"?\\])", Pattern.DOTALL); > @Override > public void process(JCas jCas) throws AnalysisEngineProcessException { > Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText()); > while (matcher.find()) { > Segment segment = new Segment(jCas); > segment.setBegin(matcher.start() + matcher.group(1).length()); > segment.setEnd(matcher.end() - matcher.group(3).length()); > segment.setId(matcher.group(2)); > segment.addToIndexes(); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)