ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: question about sentence segmentation
Date Sat, 02 Aug 2014 12:46:05 GMT
Hi Tim,

> It would be preferable to me to put sentence breaks in between the sections, so
> the first two sentences would be:
> 
> 1) PE: Lymphonodes...
> 2) Lungs: normal...

The punctuation is (always) after the logical break, being "Term: " for a Term:Definition
list.  I think that the first three sentences should be
1) PE:
2) Lymphnodes: neck and ...
3) CV: regular and ...
Where the first line is an overarching Term: sentence (tree root), because each Term:Definition
line that follows is within the physical exam.

Just an fyi.  Does that make sense?  Haven't had my coffee ...
Sean

> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Saturday, August 02, 2014 7:44 AM
> To: dev@ctakes.apache.org
> Subject: RE: question about sentence segmentation
> 
> I'm annotating some oncology notes from SHARP right now, and they are
> basically a nightmare for our current sentence segmentation model. Mainly
> because they eschew explicit markers between sentences. I thought I'd ping the
> list with some interesting examples just in case it stimulates ideas. But it seems
> to me that at some point we'll have to augment the opennlp module (preferable)
> or roll our own to handle cases like these.
> 
> In this example a bunch of background is on one line with no punctuation
> between logical breaks:
> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to
> auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2
> normal, no murmur, click, rub or gal*, chest is clear without rales or wheezing,
> no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings
> right/left breast with mild swelling, warmth, mild erythema, slightly tender, no
> seroma or hematoma Abdomen: Abdomen soft, non-tender.
> 
> It would be preferable to me to put sentence breaks in between the sections, so
> the first two sentences would be:
> 
> 1) PE: Lymphonodes...
> 2) Lungs: normal...
> 
> but without any candidate characters to split the sentence I don't think it is
> possible.
> 
> Another example that breaks our model in a different way (truncated):
> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with
> RN chemo teach  3. S U parent study
> 
> Our model will break on the period after the number, so we'd probably get:
> 1.
> Baseline labwork including tumor markers 2.
> Start DD.... 3.
> S U parent study
> 
> So the number is going in exactly the wrong place. Here it would be preferable
> to get:
> 1.
> Baseline labwork...
> 2.
> Start DD...
> 3.
> S U parent study
> 
> Anyways, just something to think about! The problem is much more complex in
> clinical data than in edited text, but I'm sure we all knew that already :)
> 
> Tim
> 
> 
> ________________________________________
> From: Miller, Timothy [Timothy.Miller@childrens.harvard.edu]
> Sent: Monday, July 28, 2014 2:38 PM
> To: dev@ctakes.apache.org
> Subject: Re: question about sentence segmentation
> 
> Yes, you're right about that Britt. I've been doing some annotations side by side
> with a treebank viewer and think I have a pretty good handle on the actual rules.
> 
> Basically, if a header or list identifier is followed by a period or a newline it is
> considered a sentence break and otherwise it is part of the sentence.
> 
> e.g.
> 
> 1. 20 mg flomax
> 
> is two sentences, while:
> 
> 1 - 20 mg flomax
> 
> is one sentence.
> 
> For headings:
> 
> Allergies: Pt is allergic to aspirin.
> 
> is one sentence, while:
> 
> Allergies:
> Pt is allergic to aspirin.
> 
> is two sentences.
> 
> I'm planning to follow these guidelines.
> 
> Tim
> 
> On 07/28/2014 01:53 PM, britt fitch wrote:
> 
> Thanks for the document, Tim. It seems to not be explicit about how to handle
> sentences occurring in lists.
> 
> Are you still considering having the list number as outside of the sentence?
> 
> Thanks
> 
> Britt
> 
> On Jul 25, 2014, at 7:09 AM, Miller, Timothy
> <Timothy.Miller@childrens.harvard.edu><mailto:Timothy.Miller@childrens.harv
> ard.edu> wrote:
> 
> 
> 
> Checking with Guergana and other colleagues here the advice is to have the
> sentence segmenter follow the treebank guidelines for sentence segmentation:
> http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
> 
> They are a bit light on detail but fortunately we have some treebanked data so I
> will use that for the training data and hopefully that will illuminate the tricky
> cases.
> 
> Tim
> 
> ________________________________________
> From: Masanz, James J.
> [Masanz.James@mayo.edu<mailto:Masanz.James@mayo.edu>]
> Sent: Tuesday, July 15, 2014 4:39 PM
> To: 'dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>'
> Subject: RE: question about sentence segmentation
> 
> Sorry, I don't know if there was a reason.
> 
> If you haven't checked with Guergana, you might want to ask her if she had a
> reason or if it was just the way it had been since that corpus was created.
> 
> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Tuesday, July 15, 2014 3:34 PM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: Re: question about sentence segmentation
> 
> Thanks James, I was hoping to hear from you. I'll probably go ahead and change
> the data to split sentences between the list header and list element.
> 
> You don't happen to know if there is any principled reason for the original style
> or whether it was just an arbitrary convention? The only thing I can think of is it
> might be hard to learn when to separate when there is no period after the list
> header (as in your examples). I think it's worth empirically checking on that
> point, but there might be other reasons that I'm not thinking of.
> 
> Thanks
> Tim
> 
> On 07/15/2014 03:27 PM, Masanz, James J. wrote:
> 
> 
> I don't have an opinion about how it should work.
> 
> But I can verify that the clinical notes from Mayo Clinic that were used in the
> initial cTAKES sentence detector model had the list markers included in the first
> sentence, so, for example, the following would be two sentences, with each line
> a separate sentence.
> 
> #1 Dilated esophagus.
> #2 Adenocarcinoma
> 
> -- James
> 
> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Tuesday, July 15, 2014 6:04 AM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: RE: question about sentence segmentation
> 
> 
> 
> My preference is to treat the list row number as outside of the sentence of
> 
> 
> interest. Or if it is necessary to be included in a sentence, have it be a sentence
> on its own.
> 
> I can get behind this, I think it makes the issue a bit cleaner, to either have the
> list header as non-sentential or it's own sentence. As far as I can tell, this is not
> the current default behavior. At least in my runs the list header seems to get
> attached to the first following sentence, even in cases where it starts with a digit
> and a period ("3. Magnesium oxide 400 mg p.o. daily." is all one sentence).
> This behavior is probably strongly dependent on the annotations we give the
> sentence detector so as I'm prepping new training data I should have a default in
> mind.
> 
> Does anyone have any objections to changing the sentence detector behavior to
> break list headers (things like "3." or "A " or "#5") as their own sentence?
> 
> Tim
> 
> 
> ________________________________________
> From: Britt Fitch [britt.fitch@gmail.com<mailto:britt.fitch@gmail.com>]
> Sent: Monday, July 14, 2014 8:29 AM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: Re: question about sentence segmentation
> 
> My preference is to treat the list row number as outside of the sentence of
> interest.
> Or if it is necessary to be included in a sentence, have it be a sentence on its
> own.
> That won't be as straightforward as splitting on a period in cases like "2.
> Magnesium oxide 400 mg p.o. daily."
> In cases where there are more than 1 written sentence like your example in the
> original email, I'd prefer those were each a sentence rather than making the
> entire list line a single sentence.
> My feeling is that each line without terminating punctuation would be a single
> sentence and would exclude the list number.
> 
> As an aside, I have encountered several issues with numbered lists being
> interpreted differently depending on 1. what number is included at the start for
> example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium oxide 400
> mg p.o. daily." (This appears to be a chunking issue where the line starting with
> "12. Magnesium" is identified as starting with chunks [O, O, B-NP, B-NP, I-NP, B-
> NP, B-ADVP, O] even though the parts of speech appear to be correct) 2.
> whether there is a period at the end of a list for example: "4. CHF" vs "4. CHF."
> (This appears to be an issue with the chunker though which produces [O,O] in
> the first case and [B-VP, B-NP, O] in the second.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu<mailto:Timothy.Miller@childrens.harvar
> d.edu>> wrote:
> 
> 
> 
> Just curious about an edge case regarding headers/lists and wondering what
> people think the correct behavior and annotation are.
> 
> In cases like this:
> 
> #1 Dilated esophagus.
> #2 Adenocarcinoma
> 
> my intuition is that each whole line is one sentence. But then there are cases
> where the number may be followed by multiple sentences on one line.
> 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.
> 
> For this example my intuition is not as clear. Should there be a break after the
> "1." or should the first sentence be "1. EGD as a complex procedure."? Again, my
> intuition leans towards the latter but it seems a bit odd since the "1." kind of
> distributes over all the following sentences (i.e. it's like a paragraph descriptor.)
> 
> Does the period after the 1 matter? The number of sentences after the list
> header? The fact that it's all on one line? Anything else?
> 
> Tim
> 
> 
> 
> 
> 
> 
> 
> 
> 


Mime
View raw message