ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <Timothy.Mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Sat, 25 Jan 2014 14:03:16 GMT
I'm running into one issue, it gets tripped up on sentences with
line-ending spaces.  I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:


(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .<LF>
it gets tripped up in another place instead (with the same error). The
specific error I get is this:

> Exception in thread "main" java.lang.IllegalArgumentException: start
> index must not be larger than end index: start=8842, end=8839
>     at opennlp.tools.util.Span.<init>(Span.java:47)
>     at opennlp.tools.util.Span.<init>(Span.java:63)
>     at
> opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(SentenceDetectorME.java:244)
>     at
> opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:56)
>     at
> opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:1)
>     at opennlp.tools.util.eval.Evaluator.evaluateSample(Evaluator.java:82)
>     at opennlp.tools.util.eval.Evaluator.evaluate(Evaluator.java:109)
>     at
> opennlp.tools.sentdetect.SDCrossValidator.evaluate(SDCrossValidator.java:130)
>     at
> opennlp.tools.cmdline.sentdetect.SentenceDetectorCrossValidatorTool.run(SentenceDetectorCrossValidatorTool.java:78)
>     at opennlp.tools.cmdline.CLI.main(CLI.java:214)

I thought I'd let you know since you might be able to fix it in 2
minutes but if I don't hear from you today I'll probably take a look at
it later today to try to fix it myself.

On 01/24/2014 04:14 PM, Jörn Kottmann wrote:
> The changes are now committed.
> To train a model which can recognize new lines the new lines must be encoded
> with the <CR> or <LF> tags (or both).
> The same tags are used to pass in the eos chars to the command line trainer.
> For example:
> SentenceDetectorCrossValidator  -lang en -data /home/xyz/eos-cr.all 
> -encoding ISO-8859-15 -eosChars .!?:<LF>
> Tim, it would be nice if you could test this with your annotations.
> Jörn
> On 01/23/2014 10:06 PM, Tim Miller wrote:
>> Just an FYI, a while back I did some of these annotations myself on 
>> MIMIC to get around this issue. I replaced the newline character with 
>> a special (non-English) character, then pre-processed ctakes input to 
>> replace newlines with that character, then did sentence detection, 
>> then added the newlines back in. I would be happy to share these 
>> annotations and my code modifications.
>> Tim
>> On 01/23/2014 04:01 PM, Karthik Sarma wrote:
>>> We could possibly add some additional datasets for training. MIMIC data
>>> does come to mind -- I can't remember off the top of my head if the 
>>> dataset has sentences spanning lines or not.
>>> -- 
>>> Karthik Sarma
>>> UCLA Medical Scientist Training Program Class of 20??
>>> Member, UCLA Medical Imaging & Informatics Lab
>>> Member, CA Delegation to the House of Delegates of the American Medical
>>> Association
>>> ksarma@ksarma.com
>>> gchat: ksarma@gmail.com
>>> linkedin: www.linkedin.com/in/ksarma
>>> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vngarla@gmail.com> wrote:
>>>> Just to clarify - with the YTEX branch there are 2 sentence splitter 
>>>> - the
>>>> original ctakes sentence that splits on newlines, and the ytex sentence
>>>> splitter that doesn't.  the changes to other components in the ytex 
>>>> branch
>>>> (dependency parser, assertion) work with both sentence splitters.
>>>> I think it would be great if the intelligence regarding how to split 
>>>> was in
>>>> the opennlp model, but this requires training data.  I don't know 
>>>> what the
>>>> training data is, or if the training data has sentences that cross 
>>>> newline
>>>> boundaries (if not, won't buy us anything).
>>>> vijay
>>>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
>>>> Sean.Finan@childrens.harvard.edu> wrote:
>>>>> On  my end it looks like my email was reformatted and some of my
>>>> -newline-
>>>>> removed in those last examples ...
>>>>> -----Original Message-----
>>>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>>>> Sent: Wednesday, January 22, 2014 3:42 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: RE: sentence detector newline behavior
>>>>> Thanks James
>>>>>> but then no typical sentence ending punctuation at the end of the

>>>>>> line
>>>>> Gotcha.
>>>>>> So simply using Lines would not suffice in those cases because it
>>>>>> would run together sentences where there are more than one on a line
>>>>> I was actually thinking about something like a Line using -sentence
>>>>> breaks- in addition to -newline-.  In other words, a Sentence being 
>>>>> what
>>>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences
>>>>> subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
>>>>> Regardless, it doesn't solve the problem of inappropriately missing
>>>>> punctuation.  I was focused a little more on the difference between
>>>>> persistent auto- line wrapping and structured information like lists,
>>>> where
>>>>> the first benefits from Sentence and the second from Line.
>>>>> "The Patient has
>>>>>   been prescribed two
>>>>>   medications."
>>>>> "Prescriptions:
>>>>>    Advil
>>>>>    Tylenol
>>>>>    No Aspirin"
>>>>> However, when it comes to the problem that you mention, there is no
>>>>> benefit to a Line.
>>>>> "The patient has been seen six times in the past week.  Pain has been
>>>>> persistent for ten days Advil and Tylenol have been prescribed"
>>>>> -- 2 sentences, 3 lines
>>>>> "The patient has been seen six times in the past week.
>>>>> Pain has been persistent for ten days
>>>>> Advil and Tylenol have been prescribed"
>>>>> -- 2 sentences, 3 lines
>>>>> "The patient has been seen six times in
>>>>>   the past week.  Pain has been persistent  for ten days Advil and
>>>> Tylenol
>>>>> have been prescribed"
>>>>> -- 2 sentences, 5 lines
>>>>> Nothing can really be done for the last bit where punctuation is 
>>>>> missing.
>>>>> -----Original Message-----
>>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>>> Sent: Wednesday, January 22, 2014 3:07 PM
>>>>> To: 'dev@ctakes.apache.org'
>>>>> Subject: RE: sentence detector newline behavior
>>>>> I know there are notes where there are multiple sentences on a 
>>>>> line, but
>>>>> then no typical sentence ending punctuation at the end of the line 
>>>>> (or no
>>>>> punctuation at all at the end of the line). And in those sections,
>>>> negation
>>>>> can be important.  So simply using Lines would not suffice in those 
>>>>> cases
>>>>> because it would run together sentences where there are more than 
>>>>> one on
>>>> a
>>>>> line. And using sentences alone (as found by OpenNLP 1.5) would not
>>>> suffice
>>>>> because it would run together sentences from different lines.
>>>>> -----Original Message-----
>>>>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>>>>> Sent: Wednesday, January 22, 2014 1:33 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: RE: sentence detector newline behavior
>>>>> Just whistling in the wind here ...
>>>>> Perhaps before any changes are made to universally toggle cTakes in 
>>>>> one
>>>>> direction or the other, we can take a poll of when & where
>>>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
>>>> to a
>>>>> Line (CR/LF delimited PLUS -sentence-)
>>>>> If some capabilities like negation detection require -lines- then 
>>>>> would
>>>> it
>>>>> make more sense to have Sentence ignore -newline- and negation 
>>>>> detection
>>>>> itself split the Sentence into line items?  If an annotator is 
>>>>> interested
>>>>> in list items, each of which may be on a distinct -line-, then it can
>>>> split
>>>>> up the Sentence as needed.  I think that James hints that cTakes code
>>>>> already does this in some places.
>>>>> If a good deal of functionality requires -newline- delimited types, 
>>>>> would
>>>>> it make sense to introduce a type Line?  If something uses a 
>>>>> structured
>>>>> list it could iterate through Line types, while something using 
>>>>> pure text
>>>>> could iterate through Sentence types.  This facilitates
>>>> section-by-section
>>>>> different behavior, does not require any decision on global 
>>>>> defaults, and
>>>>> makes data selection for training Sentence a nonesuch wrt line breaks.
>>>>>   However, it adds to the system and would require a per-use choice
>>>> decision
>>>>> by developers OR a toggle by users (back to the default decision).
>>>>> Perhaps this has already been tried?
>>>>> Sean
>>>>> -----Original Message-----
>>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>>> Sent: Wednesday, January 22, 2014 1:06 PM
>>>>> To: 'dev@ctakes.apache.org'
>>>>> Subject: RE: sentence detector newline behavior
>>>>> The only rule I know of is that cTAKES (prior to ytex integration) 
>>>>> always
>>>>> forces a sentence break at a newline.
>>>>> This was because the clinical notes cTAKES original processed never 
>>>>> had
>>>>> newlines in the middle of a sentence, but did need sentence breaks to
>>>> occur
>>>>> at end of sentence for good negation detection on those notes.
>>>>> I think Guergana earlier mentioned other EMRs also have this need, 
>>>>> but it
>>>>> seems to not be ubiquitous.
>>>>>  From others' posts, it seems that we could use an option in cTAKES to
>>>> turn
>>>>> off this forcing of sentence breaks at newlines (or depending on 
>>>>> how you
>>>>> look at it, an option to turn on the forcing of sentence breaks if we
>>>>> change the default behavior)
>>>>> I think we (cTAKES) need to decide the following:
>>>>>   - do we want to do this for entire notes, or would it be worth it to
>>>>> have it be on a section-by-section basis.
>>>>>   - what do we make the default behavior - to force or not to force
>>>>> newlines to be sentence breaks
>>>>>   - what data (that contains newlines) will we use for training the
>>>>> sentence detector
>>>>> Regardless of those answers, I think OpenNLP support for including
>>>>> newlines in training data would be valuable for those others who have
>>>>> sentences that span lines.  And having an option on OpenNLP to always
>>>> break
>>>>> at newline would be useful for at least some cTAKES users (and we 
>>>>> could
>>>>> remove the cTAKES code that does that)
>>>>> -- James
>>>>> -----Original Message-----
>>>>> From: dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
>>>>> dev-return-2390-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
>>>>> Jörn Kottmann
>>>>> Sent: Tuesday, January 21, 2014 4:29 AM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: sentence detector newline behavior
>>>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model
>>>>> which can use a new line as a end-of-sentence character.
>>>>> In case you have certain rules to split sentences we should have a 
>>>>> look
>>>> at
>>>>> them. The Sentence Detector could be extended to support a user 
>>>>> provided
>>>>> rule based splitter. If there is an interest in that we could probably
>>>> get
>>>>> it into 1.6.0 as well.
>>>>> Jörn
>>>>> On 01/20/2014 10:02 PM, Chen, Pei wrote:
>>>>>> I presume Joern was suggesting that if he supports new lines in the
>>>>> opennlp SentenceDectector (either part of the trained models or post
>>>>> processing with some rules?) cTAKES will be able to use it out of 
>>>>> the box
>>>>> and we should be able remove any additional custom logic that we
>>>> currently
>>>>> have- which seems like a good idea.
>>>>>> [but when to use within cTAKES individual components such as negation
>>>>>> might be another discussion?] --Pei
>>>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com>
>>>> wrote:
>>>>>>> The sentence detection opennlp model used by ctakes does not
>>>>>>> sentences at newlines - there is additional logic in the takes
>>>>>>> sentence splitter that does this (and an alternative impl that
>>>>>>> doesn't is in the ytex branch). Afaik no retraining / change
to the
>>>>>>> feature representation is necessary.
>>>>>>> Vj
>>>>>>>> On Monday, January 20, 2014, Jörn Kottmann <kottmann@gmail.com>
>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>> currently I have quite a bit of time to work on OpenNLP,
and would
>>>>>>>> like to help you out with this issue.
>>>>>>>> Here is the follow up issue for this change:
>>>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>>>>>> I am still trying to figure out what would be the best option
>>>>>>>> implement this.
>>>>>>>> In the training data a user could just use a special tag
>>>>>>>> identify
>>>>>>>> the chars.
>>>>>>>> Instead of <NEWLINE> it might be better to use <CR>
and <LF> to
>>>>>>>> encode these two chars in the training data. Any thoughts?
>>>>>>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>>>>>> Thanks,
>>>>>>>> Jörn
>>>>>>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>>>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>>>>>> That's awesome! It might be worth trying at least.
How does the
>>>>>>>>>> training process change? Previously the training
data would be 
>>>>>>>>>> one
>>>>>>>>>> sentence per line, but with newlines as possible
>>>>>>>>>> characters that could be trouble, is there a new
>>>>>>>>>> for training data? Or would we have to use the training
>>>>>>>>> Good point, yes that will be a problem with the default
>>>>>>>>> format, but it shouldn't be hard to solve. In the format
itself we
>>>>>>>>> could define a new line tag e.g.
>>>>>>>>> <NEWLINE> to mark new lines.
>>>>>>>>> as a hack to make it work with 1.5.3 you could instead
use a
>>>>>>>>> special char as a replacement for the new line char.
>>>>>>>>> When you pass the text down to the sentence detector
a simple
>>>>>>>>> string replace could be used to convert all new line
chars to the
>>>>>>>>> special new line marker char.
>>>>>>>>> If things work out for you performance wise as well we
will just
>>>>>>>>> integrate it properly into OpenNLP for the next release.
>>>>>>>>> Could you produce a sentence detector training file with
a new 
>>>>>>>>> line
>>>>>>>>> marker char?
>>>>>>>>> You should try to pick a char you can also pass in on
a terminal
>>>>>>>>> otherwise you have to use the API to train the model.
The build in
>>>>>>>>> cross validation could be used to evaluate the performance.
>>>>>>>>> Jörn

View raw message