ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <Timothy.Mil...@childrens.harvard.edu>
Subject Re: question about sentence segmentation
Date Mon, 04 Aug 2014 13:02:59 GMT
Very pleased to see so many people offer suggestions! Comparing some of
these different methods might make an interesting student project.

Sean:
> Just an fyi.  Does that make sense?  Haven't had my coffee ...
Makes perfect sense, the downside is it requires some kind of higher
level understanding during sentence segmentation to understand what the
hierarchy is. You could imagine something that looks similar but with a
different logical structure. Long term, some big joint model that does
all things simultaneously is definitely something I'm interested in.

Steve:
> Seems like rather than specifying a set of "candidate characters", we
> want to specify a candidate boundary regular expression.
This might be something that would be possible with minimal changes to
the model.


John:
>  why not just split sentences with regex's off a small list of defined onc physical exam
terms?
My preference for vanilla ctakes is always to do basic linguistic things
like tokenization and sentence segmentation without reference to
context-specific rules, just because it makes them less portable.
Obviously for specific use cases or applications (like what Britt is
probably doing) you will use whatever information makes sense for your
domain. But I think we could get maybe 75% of the remaining cases (which
are probably only 5% of the total # of cases) by using smarter boundary
conditions like Steve suggested.

Thanks again,
Tim


On 08/02/2014 01:26 PM, John Green wrote:
> I was thinking the same thing as Steve. Thats a pretty regular onc physical exam, why
not just split sentences with regex's off a small list of defined onc physical exam terms?
The interesting case would be breast, as this term may appear in the body of a sentence (rather
than just a term), but u could use a regex sub match where u conditionally match breast first
then one or more key physical findings to correctly identify THAT breast word token as the
term, eg beginning of the sentence. I would recommend red flag physical findings as they are
more likely to always been in the body of the sentence, for example, Breast: no lumps or masses
palpable.
>
>
> I have a few other ideas if thats barking up the right tree.
>
>
>
>
> JG
> —
> Sent from Mailbox for iPhone
>
> On Sat, Aug 2, 2014 at 8:58 AM, Steven Bethard <steven.bethard@gmail.com>
> wrote:
>
>> On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
>> <Timothy.Miller@childrens.harvard.edu> wrote:
>>> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to
auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 normal, no murmur,
click, rub or gal*, chest is clear without rales or wheezing, no pedal edema, no JVD, no hepatosplenomegaly
Breast: negative findings right/left breast with mild swelling, warmth, mild erythema, slightly
tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.
>>>
>>> It would be preferable to me to put sentence breaks in between the sections,
so the first two sentences would be:
>>>
>>> 1) PE: Lymphonodes...
>>> 2) Lungs: normal...
>> [snip]
>>> Another example that breaks our model in a different way (truncated):
>>> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with
RN chemo teach  3. S U parent study
>> [snip]
>>> Here it would be preferable to get:
>>> 1.
>>> Baseline labwork...
>>> 2.
>>> Start DD...
>>> 3.
>>> S U parent study
>> Seems like rather than specifying a set of "candidate characters", we
>> want to specify a candidate boundary regular expression. Something
>> like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
>> sentence boundaries may appear at punctuation marks, at uppercase
>> letters after word boundaries, and at numbers after a word boundaries.
>> Steve


Mime
View raw message