ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject apostrophe and sentence detector
Date Mon, 26 Aug 2013 16:05:21 GMT

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes
taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.

The training data used for the recently rebuilt model only contains only 7 lines that end
with an apostrophe (single quote)

It has >100K occurrences of 's

It has >175K occurrences of the ' character in all.

The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the
apostrophe, the apostrophe is getting annotated as a PunctuationToken.

Since I don't see anything obviously wrong with the training data, I'm pondering the idea
of having a rule that would run after the sentence detector model is used which would rejoin
any sentence split that occurs at an ' when it is immediately followed by any letter (not
just an s) and preceded by any non white space.

Some examples that currently split wrong, using vertical bar to show where the sentence detector
splits them
The patient also was concerned about a small lesion in his Broca'|s area|
Broca'|s|
Isn'|t|
The pain isn'|t preventing Don'|s daily walks.|

Some examples that currently split correctly
The aspirin isn't stopping Don's pain.|

Anyone have any other suggestions?

-- James


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message