ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britt fitch <britt.fi...@gmail.com>
Subject Re: question about sentence segmentation
Date Mon, 28 Jul 2014 17:52:21 GMT
Thanks for the document, Tim. It seems to not be explicit about how to handle sentences occurring
in lists. 

Are you still considering having the list number as outside of the sentence? 

Thanks

Britt

On Jul 25, 2014, at 7:09 AM, Miller, Timothy <Timothy.Miller@childrens.harvard.edu>
wrote:

> Checking with Guergana and other colleagues here the advice is to have the sentence segmenter
follow the treebank guidelines for sentence segmentation:
> http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
> 
> They are a bit light on detail but fortunately we have some treebanked data so I will
use that for the training data and hopefully that will illuminate the tricky cases.
> 
> Tim
> 
> ________________________________________
> From: Masanz, James J. [Masanz.James@mayo.edu]
> Sent: Tuesday, July 15, 2014 4:39 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: question about sentence segmentation
> 
> Sorry, I don't know if there was a reason.
> 
> If you haven't checked with Guergana, you might want to ask her if she had a reason or
if it was just the way it had been since that corpus was created.
> 
> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Tuesday, July 15, 2014 3:34 PM
> To: dev@ctakes.apache.org
> Subject: Re: question about sentence segmentation
> 
> Thanks James, I was hoping to hear from you. I'll probably go ahead and
> change the data to split sentences between the list header and list element.
> 
> You don't happen to know if there is any principled reason for the
> original style or whether it was just an arbitrary convention? The only
> thing I can think of is it might be hard to learn when to separate when
> there is no period after the list header (as in your examples). I think
> it's worth empirically checking on that point, but there might be other
> reasons that I'm not thinking of.
> 
> Thanks
> Tim
> 
> On 07/15/2014 03:27 PM, Masanz, James J. wrote:
>> I don't have an opinion about how it should work.
>> 
>> But I can verify that the clinical notes from Mayo Clinic that were used in the initial
cTAKES sentence detector model had the list markers included in the first sentence, so, for
example, the following would be two sentences, with each line a separate sentence.
>> 
>> #1 Dilated esophagus.
>> #2 Adenocarcinoma
>> 
>> -- James
>> 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>> Sent: Tuesday, July 15, 2014 6:04 AM
>> To: dev@ctakes.apache.org
>> Subject: RE: question about sentence segmentation
>> 
>>> My preference is to treat the list row number as outside of the sentence of
>> interest. Or if it is necessary to be included in a sentence, have it be a sentence
>> on its own.
>> 
>> I can get behind this, I think it makes the issue a bit cleaner, to either have the
list header as non-sentential or it's own sentence. As far as I can tell, this is not the
current default behavior. At least in my runs the list header seems to get attached to the
first following sentence, even in cases where it starts with a digit and a period ("3. Magnesium
oxide 400 mg p.o. daily." is all one sentence).
>> This behavior is probably strongly dependent on the annotations we give the sentence
detector so as I'm prepping new training data I should have a default in mind.
>> 
>> Does anyone have any objections to changing the sentence detector behavior to break
list headers (things like "3." or "A " or "#5") as their own sentence?
>> 
>> Tim
>> 
>> 
>> ________________________________________
>> From: Britt Fitch [britt.fitch@gmail.com]
>> Sent: Monday, July 14, 2014 8:29 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: question about sentence segmentation
>> 
>> My preference is to treat the list row number as outside of the sentence of
>> interest.
>> Or if it is necessary to be included in a sentence, have it be a sentence
>> on its own.
>> That won't be as straightforward as splitting on a period in cases
>> like "2. Magnesium
>> oxide 400 mg p.o. daily."
>> In cases where there are more than 1 written sentence like your example in
>> the original email, I'd prefer those were each a sentence rather than
>> making the entire list line a single sentence.
>> My feeling is that each line without terminating punctuation would be a
>> single sentence and would exclude the list number.
>> 
>> As an aside, I have encountered several issues with numbered lists being
>> interpreted differently depending on
>> 1. what number is included at the start
>> for example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium
>> oxide 400 mg p.o. daily." (This appears to be a chunking issue where the
>> line starting with "12. Magnesium" is identified as starting with chunks [O,
>> O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
>> appear to be correct)
>> 2. whether there is a period at the end of a list
>> for example: "4. CHF" vs "4. CHF." (This appears to be an issue with the
>> chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
>> in the second.
>> 
>> Cheers,
>> 
>> Britt
>> 
>> 
>> 
>> On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy <
>> Timothy.Miller@childrens.harvard.edu> wrote:
>> 
>>> Just curious about an edge case regarding headers/lists and wondering what
>>> people think the correct behavior and annotation are.
>>> 
>>> In cases like this:
>>> 
>>> #1 Dilated esophagus.
>>> #2 Adenocarcinoma
>>> 
>>> my intuition is that each whole line is one sentence. But then there are
>>> cases where the number may be followed by multiple sentences on one line.
>>> 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.
>>> 
>>> For this example my intuition is not as clear. Should there be a break
>>> after the "1." or should the first sentence be "1. EGD as a complex
>>> procedure."? Again, my intuition leans towards the latter but it seems a
>>> bit odd since the "1." kind of distributes over all the following sentences
>>> (i.e. it's like a paragraph descriptor.)
>>> 
>>> Does the period after the 1 matter? The number of sentences after the list
>>> header? The fact that it's all on one line? Anything else?
>>> 
>>> Tim
>>> 
> 


Mime
View raw message