ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: sentence detector newline behavior
Date Mon, 27 Jan 2014 19:35:15 GMT

Tim, is the training data something you can share publicly? Or privately?  I can't publicly
share the data that has been used to train the sentence detector, I can only share the models
that get built. And you can't build a model from an existing model + more data, you need all
the training data together.

Regarding how quickly we can get this out there, I can train a new sentence detector in a
day or two. But that's just the first step - to really incorporate this, I would suggest this
be a point release.   We would need a release manager for that.  Right now I don't have time
for that.  I haven't heard a consensus saying whether this should be the new behavior. 

>From what I remember we are going to need code changes to make optional the code that
splits at line breaks, or was your test replacing the existing cTAKES sentence detector and
just using OpenNLP directly.

-- James

-----Original Message-----
From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu] 
Sent: Monday, January 27, 2014 8:52 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance 
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly 
can we get a re-trained model into cTAKES? I heard from a researcher at 
AMIA who tried cTAKES and because of this bug in the way we handle 
sentences was trying to find an outside sentence detector as a 
preprocess to cTAKES, and frankly that is insane. We should be able to 
get something this simple right. And I think this is the kind of thing 
that can leave new users scratching their heads and doubting our overall 
competence.

James, I believe you are usually the one who rebuilds the models? What 
would be the best way to incorporate the data I have that has some 
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>> remains. I believe the problem is in SentenceSampleStream, where in the
>> following block the whitespace trim happens before the <LF> character is
>> replaced with the \n character. So test sentences that ended with <LF>
>> will be one character longer than they should be.
>>
>>> >       sentence = sentence.trim();
>>> >       sentence = replaceNewLineEscapeTags(sentence);
>>> >       sentencesString.append(sentence);
>>> >       int end = sentencesString.length();
>>> >       sentenceSpans.add(new Span(begin, end));
>>> >       sentencesString.append(' ');
>
> Yes, that must be the issue. During training the new line is inlucded 
> in the span, and during
> detection the white space remover creates a span without the new line 
> char.
>
> I suggest that the evaluator just ignores white space differences 
> between sentences. My test case then
> has the expected performance numbers.
>
> What do you think?
>
> Anyway, I committed the change. Please give it a try.
>
> Jörn


Mime
View raw message