Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Received-SPF: pass (athena.apache.org: domain of vngarla@gmail.com designates
 74.125.82.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <SNT148-W32865C72EDD6FDC447FA3AAEA20@phx.gbl>
References: <E084D8EFE2B03A408B324458C5212E9421157B32@CHEXMBX3A.CHBOSTON.ORG>
	<CADGOtThHQw25_KKda6aDLT-+Ruiz5a60VW9AQb3_W4scbV_K1Q@mail.gmail.com>
	<F7475864-AF69-46D6-A104-CEE1F7B2A346@childrens.harvard.edu>
	<52DE4BFD.803@gmail.com>
	<d2fb82$86rb39@ironport10.mayo.edu>
	<393252F14C42F946952F1ED75D316CAD3865AE43@CHEXMBX2A.CHBOSTON.ORG>
	<d2fb82$86ssqn@ironport10.mayo.edu>
	<393252F14C42F946952F1ED75D316CAD3865AE84@CHEXMBX2A.CHBOSTON.ORG>
	<393252F14C42F946952F1ED75D316CAD3865BE97@CHEXMBX2A.CHBOSTON.ORG>
	<CADGOtTiSN0XQoBmYwQ7KSbGBSTvbghSrNJm7eXMggraB6fSd_g@mail.gmail.com>
	<CAOf_dRkSOBWnnHGXXr4h_=r+Jsbsb9J4PsOQUuoDq7eAAGpn_A@mail.gmail.com>
	<52E1844D.3010507@childrens.harvard.edu>
	<52E2D79C.60101@gmail.com>
	<E084D8EFE2B03A408B324458C5212E94212EF4D9@CHEXMBX3A.CHBOSTON.ORG>
	<52E3F32B.8090604@gmail.com>
	<E084D8EFE2B03A408B324458C5212E94212F033E@CHEXMBX3A.CHBOSTON.ORG>
	<52E5227F.1000506@gmail.com>
	<E084D8EFE2B03A408B324458C5212E9424253E77@CHEXMBX3B.CHBOSTON.ORG>
	<52E63E8B.9080906@gmail.com>
	<52E67290.4080105@childrens.harvard.edu>
	<SNT148-W32865C72EDD6FDC447FA3AAEA20@phx.gbl>
Date: Mon, 27 Jan 2014 18:03:21 -0500
Message-ID: 
 <CADGOtThESt_h7+HokTa+YoFDLUrodyR6F73gRQW2cac=U8dLBA@mail.gmail.com>
Subject: Re: sentence detector newline behavior
From: vijay garla <vngarla@gmail.com>
To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
Content-Type: multipart/alternative; boundary=f46d043bdec6d8103f04f0fbb625

--f46d043bdec6d8103f04f0fbb625
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

For clarity, I'd like to stress that the opennlp sentence model distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).

I believe the improvements Tim and others are suggesting are for a new
sentence model + feature representation that takes advantage of newlines as
features.

Whatever we do, I believe we need backwards compatibility - those who are
using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?
* If a contributor trains a new model that uses a different feature
representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.

-vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula <cybersation@hotmail.com>wro=
te:

>
>
>
> Tim,
>
> I just had to chime in on a comment you made.    My deadline has been
> extended a bit on my pressing issue but I do intend to get back to testin=
g
> per VJ's fix or maybe another fix is in the works based on latest
> emails...I need to read them again since a lot has been stated on the iss=
ue.
>
> Okay, as a new user (working w/cTAKES since October) I have never thought
> what you had stated:
>
>  "And I think this is the kind of thing that can leave new users
> scratching their heads and doubting our overall competence."
>
> Yeah, the sentence-spanning-newline issue was a problem so I just brought
> attention to it by my post of inquiry earlier this month on VJ's fix from
> last month and worked around it with treating narrative as one string.
>
> Anyone who's looked at the code would appreciate and acknowledge that
> cTAKES is a powerful and complex application.  I'm overall impressed with
> it and I intend to continue to use it, improve it, and grow with it.  I'v=
e
> been delving deeper into cTAKES on the machine learning aspect...I'm
> struggling a bit with it and if anything I scratch my head and doubt my
> competence. ;-)
>
> Regards,
> Paula
>
> > Date: Mon, 27 Jan 2014 09:52:00 -0500
> > From: timothy.miller@childrens.harvard.edu
> > To: dev@ctakes.apache.org
> > Subject: Re: sentence detector newline behavior
> >
> > OK, with the most recent version I am able to replicate the performance
> > I was getting before. Thanks a lot J=F6rn!
> >
> > Assuming this is in the next incremental release of opennlp, how quickl=
y
> > can we get a re-trained model into cTAKES? I heard from a researcher at
> > AMIA who tried cTAKES and because of this bug in the way we handle
> > sentences was trying to find an outside sentence detector as a
> > preprocess to cTAKES, and frankly that is insane. We should be able to
> > get something this simple right. And I think this is the kind of thing
> > that can leave new users scratching their heads and doubting our overal=
l
> > competence.
> >
> > James, I believe you are usually the one who rebuilds the models? What
> > would be the best way to incorporate the data I have that has some
> > instances of non-sentence terminating newlines?
> >
> > Tim
> >
> >
> > On 01/27/2014 06:10 AM, J=F6rn Kottmann wrote:
> > > On 01/26/2014 11:29 PM, Miller, Timothy wrote:
> > >> Yes, this fixes the whitespace sentence issue but the evaluation iss=
ue
> > >> remains. I believe the problem is in SentenceSampleStream, where in
> the
> > >> following block the whitespace trim happens before the <LF> characte=
r
> is
> > >> replaced with the \n character. So test sentences that ended with <L=
F>
> > >> will be one character longer than they should be.
> > >>
> > >>> >       sentence =3D sentence.trim();
> > >>> >       sentence =3D replaceNewLineEscapeTags(sentence);
> > >>> >       sentencesString.append(sentence);
> > >>> >       int end =3D sentencesString.length();
> > >>> >       sentenceSpans.add(new Span(begin, end));
> > >>> >       sentencesString.append(' ');
> > >
> > > Yes, that must be the issue. During training the new line is inlucded
> > > in the span, and during
> > > detection the white space remover creates a span without the new line
> > > char.
> > >
> > > I suggest that the evaluator just ignores white space differences
> > > between sentences. My test case then
> > > has the expected performance numbers.
> > >
> > > What do you think?
> > >
> > > Anyway, I committed the change. Please give it a try.
> > >
> > > J=F6rn
> >
>
>
>

--f46d043bdec6d8103f04f0fbb625--