ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Green" <john.travis.gr...@gmail.com>
Subject Re: question about sentence segmentation
Date Sat, 02 Aug 2014 17:19:03 GMT
I was thinking the same thing as Steve. Thats a pretty regular onc physical exam, why not just
split sentences with regex's off a small list of defined onc physical exam terms? The interesting
case would be breast, as this term may appear in the body of a sentence (rather than just
a term), but u could use a regex sub match where u conditionally match breast first then one
or more key physical findings to correctly identify THAT breast word token as the term, eg
beginning of the sentence. I would recommend red flag physical findings as they are more likely
to always been in the body of the sentence, for example, Breast: no lumps or masses palpable.


I have a few other ideas if thats barking up the right tree.




JG
—
Sent from Mailbox for iPhone

On Sat, Aug 2, 2014 at 8:58 AM, Steven Bethard <steven.bethard@gmail.com>
wrote:

> On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
> <Timothy.Miller@childrens.harvard.edu> wrote:
>> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to auscultation
CV: regular rate and rhythm without murmur or gallop , S1, S2 normal, no murmur, click, rub
or gal*, chest is clear without rales or wheezing, no pedal edema, no JVD, no hepatosplenomegaly
Breast: negative findings right/left breast with mild swelling, warmth, mild erythema, slightly
tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.
>>
>> It would be preferable to me to put sentence breaks in between the sections, so the
first two sentences would be:
>>
>> 1) PE: Lymphonodes...
>> 2) Lungs: normal...
> [snip]
>> Another example that breaks our model in a different way (truncated):
>> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with RN
chemo teach  3. S U parent study
> [snip]
>> Here it would be preferable to get:
>> 1.
>> Baseline labwork...
>> 2.
>> Start DD...
>> 3.
>> S U parent study
> Seems like rather than specifying a set of "candidate characters", we
> want to specify a candidate boundary regular expression. Something
> like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
> sentence boundaries may appear at punctuation marks, at uppercase
> letters after word boundaries, and at numbers after a word boundaries.
> Steve
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message