Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-Id: <d2fb82$87p735@ironport10.mayo.edu>
Date: Mon, 27 Jan 2014 19:53:15 +0000
From: "Masanz, James J." <Masanz.James@mayo.edu>
Subject: RE: sentence detector newline behavior
In-reply-to: <52E6B739.3030308@childrens.harvard.edu>
To: "'dev@ctakes.apache.org'" <dev@ctakes.apache.org>
MIME-version: 1.0
Content-type: text/plain; charset=iso-8859-1
Content-language: en-US
Content-transfer-encoding: quoted-printable
Accept-Language: en-US
Thread-topic: sentence detector newline behavior
Thread-index: Ac5WG7UABEU57a2lTXuxbf1ulBTIZDFZukmAAAfAYwAAAv9O8AAHO9+AAAx8M5A=
References: <E084D8EFE2B03A408B324458C5212E9421157B32@CHEXMBX3A.CHBOSTON.ORG>
 <52DE4BFD.803@gmail.com> <d2fb82$86rb39@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG>
 <d2fb82$86ssqn@ironport10.mayo.edu>
 <393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG>
 <393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG>
 <CADGOtTiSN0XQoBmYwQ7KSbGBSTvbghSrNJm7eXMggraB6fSd_g@mail.gmail.com>
 <CAOf_dRkSOBWnnHGXXr4h_=r+Jsbsb9J4PsOQUuoDq7eAAGpn_A@mail.gmail.com>
 <52E1844D.3010507@childrens.harvard.edu> <52E2D79C.60101@gmail.com>
 <E084D8EFE2B03A408B324458C5212E94212EF4D9@CHEXMBX3A.CHBOSTON.ORG>
 <52E3F32B.8090604@gmail.com>
 <E084D8EFE2B03A408B324458C5212E94212F033E@CHEXMBX3A.CHBOSTON.ORG>
 <52E5227F.1000506@gmail.com>
 <E084D8EFE2B03A408B324458C5212E9424253E77@CHEXMBX3B.CHBOSTON.ORG>
 <52E63E8B.9080906@gmail.com> <52E67290.4080105@childrens.harvard.edu>
 <d2fb82$87ov9v@ironport10.mayo.edu> <52E6B739.3030308@childrens.harvard.edu>

I didn't write the cTAKES sentence detector so I can't answer definitively =
but I do know it was originally written using what is now a pretty old vers=
ion of OpenNLP and needed some things you couldn't get from the out-of-the-=
box OpenNLP at the time. From  what I remember the things specific to it we=
re=20
- the list of end of sentence candidate characters=20
- and the handling of newlines

-- James

-----Original Message-----
From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu]=20
Sent: Monday, January 27, 2014 1:45 PM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior


On 01/27/2014 02:35 PM, Masanz, James J. wrote:
> Tim, is the training data something you can share publicly? Or privately?=
  I can't publicly share the data that has been used to train the sentence =
detector, I can only share the models that get built. And you can't build a=
 model from an existing model + more data, you need all the training data t=
ogether.

It is from the MIMIC corpus which I definitely can't share publicly, but=20
it's worth looking into whether I could share it privately with another=20
person who has a signed data use agreement.

> Regarding how quickly we can get this out there, I can train a new senten=
ce detector in a day or two. But that's just the first step - to really inc=
orporate this, I would suggest this be a point release.   We would need a r=
elease manager for that.  Right now I don't have time for that.  I haven't =
heard a consensus saying whether this should be the new behavior.
Yeah I suppose this is subject to the scale of the changes we make.
>  From what I remember we are going to need code changes to make optional =
the code that splits at line breaks, or was your test replacing the existin=
g cTAKES sentence detector and just using OpenNLP directly.

That is a good point, and something I was wondering about. Having now=20
looked at both the ctakes and opennlp code for the sentence splitter it=20
seems like there is a lot of overlap. I would've thought it was just a=20
matter of converting annotations into our type system. So I'm curious if=20
there is some justification for why there seems to be duplication (or if=20
I'm hallucinating it).

Tim


>
> -- James
>
> -----Original Message-----
> From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu]
> Sent: Monday, January 27, 2014 8:52 AM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
>
> OK, with the most recent version I am able to replicate the performance
> I was getting before. Thanks a lot J=F6rn!
>
> Assuming this is in the next incremental release of opennlp, how quickly
> can we get a re-trained model into cTAKES? I heard from a researcher at
> AMIA who tried cTAKES and because of this bug in the way we handle
> sentences was trying to find an outside sentence detector as a
> preprocess to cTAKES, and frankly that is insane. We should be able to
> get something this simple right. And I think this is the kind of thing
> that can leave new users scratching their heads and doubting our overall
> competence.
>
> James, I believe you are usually the one who rebuilds the models? What
> would be the best way to incorporate the data I have that has some
> instances of non-sentence terminating newlines?
>
> Tim
>
>
> On 01/27/2014 06:10 AM, J=F6rn Kottmann wrote:
>> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>>> remains. I believe the problem is in SentenceSampleStream, where in the
>>> following block the whitespace trim happens before the <LF> character i=
s
>>> replaced with the \n character. So test sentences that ended with <LF>
>>> will be one character longer than they should be.
>>>
>>>>>        sentence =3D sentence.trim();
>>>>>        sentence =3D replaceNewLineEscapeTags(sentence);
>>>>>        sentencesString.append(sentence);
>>>>>        int end =3D sentencesString.length();
>>>>>        sentenceSpans.add(new Span(begin, end));
>>>>>        sentencesString.append(' ');
>> Yes, that must be the issue. During training the new line is inlucded
>> in the span, and during
>> detection the white space remover creates a span without the new line
>> char.
>>
>> I suggest that the evaluator just ignores white space differences
>> between sentences. My test case then
>> has the expected performance numbers.
>>
>> What do you think?
>>
>> Anyway, I committed the change. Please give it a try.
>>
>> J=F6rn