ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Thu, 23 Jan 2014 21:06:21 GMT
Just an FYI, a while back I did some of these annotations myself on 
MIMIC to get around this issue. I replaced the newline character with a 
special (non-English) character, then pre-processed ctakes input to 
replace newlines with that character, then did sentence detection, then 
added the newlines back in. I would be happy to share these annotations 
and my code modifications.
Tim


On 01/23/2014 04:01 PM, Karthik Sarma wrote:
> We could possibly add some additional datasets for training. MIMIC data
> does come to mind -- I can't remember off the top of my head if the MIMIC
> dataset has sentences spanning lines or not.
>
>
>
>
>
> --
> Karthik Sarma
> UCLA Medical Scientist Training Program Class of 20??
> Member, UCLA Medical Imaging & Informatics Lab
> Member, CA Delegation to the House of Delegates of the American Medical
> Association
> ksarma@ksarma.com
> gchat: ksarma@gmail.com
> linkedin: www.linkedin.com/in/ksarma
>
>
> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vngarla@gmail.com> wrote:
>
>> Just to clarify - with the YTEX branch there are 2 sentence splitter - the
>> original ctakes sentence that splits on newlines, and the ytex sentence
>> splitter that doesn't.  the changes to other components in the ytex branch
>> (dependency parser, assertion) work with both sentence splitters.
>>
>> I think it would be great if the intelligence regarding how to split was in
>> the opennlp model, but this requires training data.  I don't know what the
>> training data is, or if the training data has sentences that cross newline
>> boundaries (if not, won't buy us anything).
>>
>> vijay
>>
>>
>>
>>
>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
>> Sean.Finan@childrens.harvard.edu> wrote:
>>
>>> On  my end it looks like my email was reformatted and some of my
>> -newline-
>>> removed in those last examples ...
>>>
>>> -----Original Message-----
>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>> Sent: Wednesday, January 22, 2014 3:42 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: sentence detector newline behavior
>>>
>>> Thanks James
>>>
>>>> but then no typical sentence ending punctuation at the end of the line
>>> Gotcha.
>>>
>>>> So simply using Lines would not suffice in those cases because it
>>>> would run together sentences where there are more than one on a line
>>> I was actually thinking about something like a Line using -sentence
>>> breaks- in addition to -newline-.  In other words, a Sentence being what
>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences
>>> subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
>>> Regardless, it doesn't solve the problem of inappropriately missing
>>> punctuation.  I was focused a little more on the difference between
>>> persistent auto- line wrapping and structured information like lists,
>> where
>>> the first benefits from Sentence and the second from Line.
>>>
>>> "The Patient has
>>>   been prescribed two
>>>   medications."
>>>
>>> "Prescriptions:
>>>    Advil
>>>    Tylenol
>>>    No Aspirin"
>>>
>>>
>>> However, when it comes to the problem that you mention, there is no
>>> benefit to a Line.
>>>
>>> "The patient has been seen six times in the past week.  Pain has been
>>> persistent for ten days Advil and Tylenol have been prescribed"
>>> -- 2 sentences, 3 lines
>>>
>>>
>>> "The patient has been seen six times in the past week.
>>> Pain has been persistent for ten days
>>> Advil and Tylenol have been prescribed"
>>> -- 2 sentences, 3 lines
>>>
>>> "The patient has been seen six times in
>>>   the past week.  Pain has been persistent  for ten days  Advil and
>> Tylenol
>>> have been prescribed"
>>> -- 2 sentences, 5 lines
>>>
>>> Nothing can really be done for the last bit where punctuation is missing.
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>> Sent: Wednesday, January 22, 2014 3:07 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: sentence detector newline behavior
>>>
>>>
>>> I know there are notes where there are multiple sentences on a line, but
>>> then no typical sentence ending punctuation at the end of the line (or no
>>> punctuation at all at the end of the line). And in those sections,
>> negation
>>> can be important.  So simply using Lines would not suffice in those cases
>>> because it would run together sentences where there are more than one on
>> a
>>> line. And using sentences alone (as found by OpenNLP 1.5) would not
>> suffice
>>> because it would run together sentences from different lines.
>>>
>>> -----Original Message-----
>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>> Sent: Wednesday, January 22, 2014 1:33 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: sentence detector newline behavior
>>>
>>> Just whistling in the wind here ...
>>>
>>> Perhaps before any changes are made to universally toggle cTakes in one
>>> direction or the other, we can take a poll of when & where
>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
>> to a
>>> Line (CR/LF delimited PLUS -sentence-)
>>>
>>> If some capabilities like negation detection require -lines- then would
>> it
>>> make more sense to have Sentence ignore -newline- and negation detection
>>> itself split the Sentence into line items?  If an annotator is interested
>>> in list items, each of which may be on a distinct -line-, then it can
>> split
>>> up the Sentence as needed.  I think that James hints that cTakes code
>>> already does this in some places.
>>>
>>> If a good deal of functionality requires -newline- delimited types, would
>>> it make sense to introduce a type Line?  If something uses a structured
>>> list it could iterate through Line types, while something using pure text
>>> could iterate through Sentence types.  This facilitates
>> section-by-section
>>> different behavior, does not require any decision on global defaults, and
>>> makes data selection for training Sentence a nonesuch wrt line breaks.
>>>   However, it adds to the system and would require a per-use choice
>> decision
>>> by developers OR a toggle by users (back to the default decision).
>>> Perhaps this has already been tried?
>>>
>>> Sean
>>>
>>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>> Sent: Wednesday, January 22, 2014 1:06 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: sentence detector newline behavior
>>>
>>> The only rule I know of is that cTAKES (prior to ytex integration) always
>>> forces a sentence break at a newline.
>>> This was because the clinical notes cTAKES original processed never had
>>> newlines in the middle of a sentence, but did need sentence breaks to
>> occur
>>> at end of sentence for good negation detection on those notes.
>>> I think Guergana earlier mentioned other EMRs also have this need, but it
>>> seems to not be ubiquitous.
>>>
>>>  From others' posts, it seems that we could use an option in cTAKES to
>> turn
>>> off this forcing of sentence breaks at newlines (or depending on how you
>>> look at it, an option to turn on the forcing of sentence breaks if we
>>> change the default behavior)
>>>
>>> I think we (cTAKES) need to decide the following:
>>>   - do we want to do this for entire notes, or would it be  worth it to
>>> have it be on a section-by-section basis.
>>>   - what do we make the default behavior - to force or not to force
>>> newlines to be sentence breaks
>>>   - what data (that contains newlines) will we use for training the
>>> sentence detector
>>>
>>> Regardless of those answers, I think OpenNLP support for including
>>> newlines in training data would be valuable for those others who have
>>> sentences that span lines.  And having an option on OpenNLP to always
>> break
>>> at newline would be useful for at least some cTAKES users (and we could
>>> remove the cTAKES code that does that)
>>>
>>> -- James
>>>
>>> -----Original Message-----
>>> From: dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
>>> dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
>>> Jörn Kottmann
>>> Sent: Tuesday, January 21, 2014 4:29 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: sentence detector newline behavior
>>>
>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model
>>> which can use a new line as a end-of-sentence character.
>>>
>>> In case you have certain rules to split sentences we should have a look
>> at
>>> them. The Sentence Detector could be extended to support a user provided
>>> rule based splitter. If there is an interest in that we could probably
>> get
>>> it into 1.6.0 as well.
>>>
>>> Jörn
>>>
>>> On 01/20/2014 10:02 PM, Chen, Pei wrote:
>>>> I presume Joern was suggesting that if he supports new lines in the
>>> opennlp SentenceDectector (either part of the trained models or post
>>> processing with some rules?) cTAKES will be able to use it out of the box
>>> and we should be able remove any additional custom logic that we
>> currently
>>> have- which seems like a good idea.
>>>> [but when to use within cTAKES individual components such as negation
>>>> might be another discussion?] --Pei
>>>>
>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com>
>> wrote:
>>>>> The sentence detection opennlp model used by ctakes does not split
>>>>> sentences at newlines - there is additional logic in the takes
>>>>> sentence splitter that does this (and an alternative impl that
>>>>> doesn't is in the ytex branch). Afaik no retraining / change to the
>>>>> feature representation is necessary.
>>>>>
>>>>> Vj
>>>>>
>>>>>> On Monday, January 20, 2014, Jörn Kottmann <kottmann@gmail.com>
>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> currently I have quite a bit of time to work on OpenNLP, and would
>>>>>> like to help you out with this issue.
>>>>>>
>>>>>> Here is the follow up issue for this change:
>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>>>>
>>>>>> I am still trying to figure out what would be the best option to
>>>>>> implement this.
>>>>>> In the training data a user could just use a special tag to identify
>>>>>> the chars.
>>>>>>
>>>>>> Instead of <NEWLINE> it might be better to use <CR> and
<LF> to
>>>>>> encode these two chars in the training data. Any thoughts?
>>>>>>
>>>>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>>>>
>>>>>> Thanks,
>>>>>> Jörn
>>>>>>
>>>>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>>>>>>>
>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>>>>
>>>>>>>> That's awesome! It might be worth trying at least. How does
the
>>>>>>>> training process change? Previously the training data would
be one
>>>>>>>> sentence per line, but with newlines as possible mid-sentence
>>>>>>>> characters that could be trouble, is there a new representation
>>>>>>>> for training data? Or would we have to use the training api?
>>>>>>> Good point, yes that will be a problem with the default training
>>>>>>> format, but it shouldn't be hard to solve. In the format itself
we
>>>>>>> could define a new line tag e.g.
>>>>>>> <NEWLINE> to mark new lines.
>>>>>>> as a hack to make it work with 1.5.3 you could instead use a
>>>>>>> special char as a replacement for the new line char.
>>>>>>> When you pass the text down to the sentence detector a simple
>>>>>>> string replace could be used to convert all new line chars to
the
>>>>>>> special new line marker char.
>>>>>>>
>>>>>>> If things work out for you performance wise as well we will just
>>>>>>> integrate it properly into OpenNLP for the next release.
>>>>>>>
>>>>>>> Could you produce a sentence detector training file with a new
line
>>>>>>> marker char?
>>>>>>>
>>>>>>> You should try to pick a char you can also pass in on a terminal
>>>>>>> otherwise you have to use the API to train the model. The build
in
>>>>>>> cross validation could be used to evaluate the performance.
>>>>>>>
>>>>>>> Jörn
>>>


Mime
View raw message