ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Joseph Masanz (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CTAKES-227) Broca's -> PunctuationToken instead of ContractionToken - caused by apostrophe seen as sentence ending
Date Mon, 26 Aug 2013 18:21:52 GMT
James Joseph Masanz created CTAKES-227:
------------------------------------------

             Summary: Broca's -> PunctuationToken instead of ContractionToken - caused
by apostrophe seen as sentence ending
                 Key: CTAKES-227
                 URL: https://issues.apache.org/jira/browse/CTAKES-227
             Project: cTAKES
          Issue Type: Bug
          Components: ctakes-core
    Affects Versions: 3.1
            Reporter: James Joseph Masanz
            Assignee: James Joseph Masanz



The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes
taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn’t.

The training data used for the recently rebuilt model only contains only 7 lines that end
with an apostrophe (single quote) followed immediately by a newline

It has >100K occurrences of 's

It has >175K occurrences of the ' character in all.

The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the
apostrophe, the apostrophe is getting annotated as a PunctuationToken.


See more in the thread started at
http://markmail.org/message/wavipejszlspzo5u
including examples that split correctly and incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message