ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Bethard <steven.beth...@gmail.com>
Subject Re: question about sentence segmentation
Date Sat, 02 Aug 2014 12:58:22 GMT
On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
<Timothy.Miller@childrens.harvard.edu> wrote:
> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to auscultation
CV: regular rate and rhythm without murmur or gallop , S1, S2 normal, no murmur, click, rub
or gal*, chest is clear without rales or wheezing, no pedal edema, no JVD, no hepatosplenomegaly
Breast: negative findings right/left breast with mild swelling, warmth, mild erythema, slightly
tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.
>
> It would be preferable to me to put sentence breaks in between the sections, so the first
two sentences would be:
>
> 1) PE: Lymphonodes...
> 2) Lungs: normal...
[snip]
> Another example that breaks our model in a different way (truncated):
> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with RN chemo
teach  3. S U parent study
[snip]
> Here it would be preferable to get:
> 1.
> Baseline labwork...
> 2.
> Start DD...
> 3.
> S U parent study

Seems like rather than specifying a set of "candidate characters", we
want to specify a candidate boundary regular expression. Something
like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
sentence boundaries may appear at punctuation marks, at uppercase
letters after word boundaries, and at numbers after a word boundaries.

Steve

Mime
View raw message