ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Mon, 27 Jan 2014 19:44:57 GMT

On 01/27/2014 02:35 PM, Masanz, James J. wrote:
> Tim, is the training data something you can share publicly? Or privately?  I can't publicly
share the data that has been used to train the sentence detector, I can only share the models
that get built. And you can't build a model from an existing model + more data, you need all
the training data together.

It is from the MIMIC corpus which I definitely can't share publicly, but 
it's worth looking into whether I could share it privately with another 
person who has a signed data use agreement.

> Regarding how quickly we can get this out there, I can train a new sentence detector
in a day or two. But that's just the first step - to really incorporate this, I would suggest
this be a point release.   We would need a release manager for that.  Right now I don't have
time for that.  I haven't heard a consensus saying whether this should be the new behavior.
Yeah I suppose this is subject to the scale of the changes we make.
>  From what I remember we are going to need code changes to make optional the code that
splits at line breaks, or was your test replacing the existing cTAKES sentence detector and
just using OpenNLP directly.

That is a good point, and something I was wondering about. Having now 
looked at both the ctakes and opennlp code for the sentence splitter it 
seems like there is a lot of overlap. I would've thought it was just a 
matter of converting annotations into our type system. So I'm curious if 
there is some justification for why there seems to be duplication (or if 
I'm hallucinating it).


> -- James
> -----Original Message-----
> From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu]
> Sent: Monday, January 27, 2014 8:52 AM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> OK, with the most recent version I am able to replicate the performance
> I was getting before. Thanks a lot Jörn!
> Assuming this is in the next incremental release of opennlp, how quickly
> can we get a re-trained model into cTAKES? I heard from a researcher at
> AMIA who tried cTAKES and because of this bug in the way we handle
> sentences was trying to find an outside sentence detector as a
> preprocess to cTAKES, and frankly that is insane. We should be able to
> get something this simple right. And I think this is the kind of thing
> that can leave new users scratching their heads and doubting our overall
> competence.
> James, I believe you are usually the one who rebuilds the models? What
> would be the best way to incorporate the data I have that has some
> instances of non-sentence terminating newlines?
> Tim
> On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
>> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>>> remains. I believe the problem is in SentenceSampleStream, where in the
>>> following block the whitespace trim happens before the <LF> character is
>>> replaced with the \n character. So test sentences that ended with <LF>
>>> will be one character longer than they should be.
>>>>>        sentence = sentence.trim();
>>>>>        sentence = replaceNewLineEscapeTags(sentence);
>>>>>        sentencesString.append(sentence);
>>>>>        int end = sentencesString.length();
>>>>>        sentenceSpans.add(new Span(begin, end));
>>>>>        sentencesString.append(' ');
>> Yes, that must be the issue. During training the new line is inlucded
>> in the span, and during
>> detection the white space remover creates a span without the new line
>> char.
>> I suggest that the evaluator just ignores white space differences
>> between sentences. My test case then
>> has the expected performance numbers.
>> What do you think?
>> Anyway, I committed the change. Please give it a try.
>> Jörn

View raw message