incubator-ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Bethard (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters
Date Fri, 08 Feb 2013 05:19:15 GMT
Steven Bethard created CTAKES-155:
-------------------------------------

             Summary: SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters
                 Key: CTAKES-155
                 URL: https://issues.apache.org/jira/browse/CTAKES-155
             Project: cTAKES
          Issue Type: Bug
          Components: ctakes-core
    Affects Versions: 3.0-incubating
            Reporter: Steven Bethard
             Fix For: 3.1-incubating


The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I believe it assumes
all sections are 5 characters long here:

{code:java}
	fileReader.read(sectIdArr, 0, 5);
{code}

As a result, when the section name is longer than that, some part of the section heading (e.g.
for a 6 letter section name, the final "]") is left in the text of the next section. This
results, for example, in the dependency parser choking:

{code:java}
Caused by: java.lang.NullPointerException
	at clear.pos.PosEnLib.isNoun(PosEnLib.java:56)
	at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273)
	at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247)
{code}

I would fix this but:

(1) There are no tests for SimpleSegmentWithTagsAnnotator and it's documentation actually
says "Creates a single segment annotation that spans the entire document" which is just untrue,
so I'm not really sure what this annotator is intended to do.

(2) Even if I make some assumptions about what it's intended to do, the code is written in
an extremely brittle fashion, and I'm afraid to make changes to that. For what it's worth,
here's what I think the annotator should really look like:

{code:java}
  public static class SegmentsFromBracketedSectionTagsAnnotator extends JCasAnnotator_ImplBase
{
    private static Pattern SECTION_PATTERN =
        Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end section id=\"?(.*?)\"?\\])",
Pattern.DOTALL);

    @Override
    public void process(JCas jCas) throws AnalysisEngineProcessException {
      Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText());
      while (matcher.find()) {
        Segment segment = new Segment(jCas);
        segment.setBegin(matcher.start() + matcher.group(1).length());
        segment.setEnd(matcher.end() - matcher.group(3).length());
        segment.setId(matcher.group(2));
        segment.addToIndexes();
      }
    }
  }
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message