Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Received-SPF: pass (athena.apache.org: domain of cybersation@hotmail.com
 designates 65.54.51.98 as permitted sender)
Message-ID: <SNT148-W2632751EA88CE8BBA37F4FAEBB0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_df16c7f5-3337-46a5-b864-1dc57299d19d_"
From: digital paula <cybersation@hotmail.com>
To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
Subject: RE: sentence splitter & forks/branches
Date: Fri, 17 Jan 2014 22:45:44 -0500
Importance: Normal
In-Reply-To: 
 <CADGOtThfNUh7cJvEARHf-nCoom_6TnNu1jybL552_xBD-PzUUg@mail.gmail.com>
References: 
 <5652E5352040D7429DEF7AAB8560EF041884F757@MCEXMB1.chmccorp.cchmc.org>,<CADGOtThfNUh7cJvEARHf-nCoom_6TnNu1jybL552_xBD-PzUUg@mail.gmail.com>
MIME-Version: 1.0

--_df16c7f5-3337-46a5-b864-1dc57299d19d_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

=0A=
=0A=
=0A=
Hello again cTAKES Community=2C  I thought that adding the sentence splitte=
r(w/newline-sentence-continuation-recognition) would have been as simple as=
 it was adding the sectionizer annotator to the eclipse environment.  I see=
 per VJ's note that it's not that simple=2C my understanding is that the st=
andard clinical pipeline requires the assertion and dependency parsers. I'v=
e explored a bit of the changes needed and at least for Assertion looks lik=
e SentenceDetector=2C SentenceSpan=2C likely the SingleDocumentProcessor fr=
om the MITRE jar will need to be modified to recognize multi-line sentences=
.   This is so the assertion and dependency parsers can be kept in the pipe=
line.  I would love to devote the time needed to fix the sentence split to =
recognize sentences that are multiline but I need to focus on hacking my wa=
y through the cue word issue because I've been left in the lurch with no re=
sponse to my posts  :-((((( =20
Regards=2C
Paula
=20
> Date: Wed=2C 15 Jan 2014 14:53:17 -0500
> Subject: Re: sentence splitter & forks/branches
> From: vngarla@gmail.com
> To: dev@ctakes.apache.org
>=20
> It is unfortunately not that trivial=2C as allowing newlines within sente=
nces
> requires changes to the assertion and dependency parser modules.
>=20
> If you're not using those AEs you could theoretically build the ytex
> branch=2C and just add  ctakes-ytex-uima.jar and
> ctakes-ytex-uima\desc\analysis_engine\SentenceDetectorAnnotator.xml to yo=
ur
> exsting ctakes install (haven't tried it=2C but it should work).
>=20
> -vj
>=20
>=20
> On Wed=2C Jan 15=2C 2014 at 1:57 PM=2C Lingren=2C Todd <Todd.Lingren@cchm=
c.org>wrote:
>=20
> > I have a general question about forks=2C specifically the YTEX branch t=
hat
> > Vijay mentions.
> > If I wanted to implement just the sentence splitter from YTEX into a
> > currently existing 3.1 install=2C how would I do that? Is it possible? =
Or do
> > I have to switch over completely to run from YTEX branch?
> >
> > Todd Lingren
> > Biomedical Informatics
> > Cincinnati Children's Hospital
> > Todd.Lingren@cchmc.org
> > 513-803-9032
> >
> >
> > -----Original Message-----
> > From: vijay garla [mailto:vngarla@gmail.com]
> > Sent: Wednesday=2C January 15=2C 2014 11:34 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: svn commit: r1551805 -
> > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/=
assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesImpl.j=
ava
> >
> > The issue is indeed the sentence splitter - negation is limited to word=
s
> > within the sentence=2C and if newlines are considered sentence boundari=
es=2C it
> > doesn't work properly (splitting on newlines breaks many other things a=
s
> > well).  The YTEX branch includes a sentence splitter that does not
> > automatically split sentences on newlines.
> >
> > best=2C
> >
> > vj
> >
> >
> > On Wed=2C Jan 15=2C 2014 at 10:03 AM=2C Masanz=2C James J. <Masanz.Jame=
s@mayo.edu
> > >wrote:
> >
> > > Hi Paula=2C
> > >
> > > The sentence detector in 3.1.0 and 3.1.1 (and previous releases)
> > > assumes sentences don't cross line boundaries.
> > > OpenNLP is used to find sentence breaks=2C but then if newlines are
> > > found=2C those are also set (within cTAKES=2C not OpenNLP) to be sent=
ence
> > breaks.
> > >
> > > (just FYI I haven't had a chance to look at the ytex branch=2C which =
the
> > > subject commit is about)
> > >
> > > -- James
> > >
> > > -----Original Message-----
> > > From: dev-return-2375-Masanz.James=3Dmayo.edu@ctakes.apache.org [mail=
to:
> > > dev-return-2375-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf =
Of
> > > digital paula
> > > Sent: Tuesday=2C January 14=2C 2014 10:25 PM
> > > To: dev@ctakes.apache.org
> > > Subject: RE: svn commit: r1551805 -
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake=
s
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtake=
s
> > > Impl.java
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hello cTAKES Developer Community=2C
> > >  I'm a little behind on reading posts....this one is from last month.
> > > I think this issue is already addressed in current release? I'm still
> > > running the previous release...3.1.0.
> > > I just noticed something interesting=2C the negation didn't take when=
 it
> > > is on a different line.  I just removed all carriage returns from
> > narratives
> > > and negation picked it up as long as it's treated as one long string.
> > To
> > > better explain what I mean.  Two narrative comments below.
> > >
> > > 1.  patient did not have diabetes
> > > 2. patient did not have
> > > diabetes
> > >
> > > Number 1 above got negated but number 2 did not. This might be relate=
d
> > > to the issue w/the sectionizer.  I noticed that when I treated the
> > narrative
> > > as one string the sectionizer never crashes with the NPE.   Well the
> > > sectionizer is of no point if narrative is as one string but it's
> > > helping me pinpoint the problem.
> > >
> > > Regards=2C
> > > Paula
> > >
> > >
> > > > Date: Thu=2C 19 Dec 2013 11:04:57 -0500
> > > > Subject: Re: FW: svn commit: r1551805 -
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake=
s
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtake=
s
> > > Impl.java
> > > > From: vngarla@gmail.com
> > > > To: dev@ctakes.apache.org
> > > >
> > > > Hi Pei=2C
> > > >
> > > > I'm not sure if that would solve the problem: change in the ytex
> > > > branch causes newlines to be ignored (i.e. not treated as a token).
> > > > trunk's sentence splitter is splits sentences on newlines=2C so
> > > > newlines would
> > > never
> > > > be found in a sentence.  However=2C if we had a reproducer we could
> > > > check
> > > it
> > > > fairly easily in the ytex branch.
> > > >
> > > > Best=2C
> > > >
> > > > VJ
> > > >
> > > >
> > > > On Thu=2C Dec 19=2C 2013 at 10:15 AM=2C Chen=2C Pei
> > > > <Pei.Chen@childrens.harvard.edu>wrote:
> > > >
> > > > > Vj=2C
> > > > > Do you think this is what was causing the NPE's [1]?
> > > > > If so=2C shall we make the same fix in trunk?
> > > > > --Pei
> > > > >
> > > > > [1]
> > > > >
> > > http://mail-archives.apache.org/mod_mbox/ctakes-dev/201309.mbox/%3C92=
4
> > > DE05C19409B438EB81DE683A942D9105A93CB%40CHEXMBX1A.CHBOSTON.ORG%3E
> > > > >
> > > > > -----Original Message-----
> > > > > From: vjapache@apache.org [mailto:vjapache@apache.org]
> > > > > Sent: Tuesday=2C December 17=2C 2013 9:15 PM
> > > > > To: commits@ctakes.apache.org
> > > > > Subject: svn commit: r1551805 -
> > > > >
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake=
s
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtake=
s
> > > Impl.java
> > > > >
> > > > > Author: vjapache
> > > > > Date: Wed Dec 18 02:14:13 2013
> > > > > New Revision: 1551805
> > > > >
> > > > > URL: http://svn.apache.org/r1551805
> > > > > Log:
> > > > > add support for sentences that contain newline tokens.
> > > > >
> > > > > Modified:
> > > > >
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes=
/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes=
I
> > > mpl.java
> > > > >
> > > > > Modified:
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes=
/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes=
I
> > > mpl.java
> > > > > URL:
> > > > >
> > > http://svn.apache.org/viewvc/ctakes/branches/ytex/ctakes-assertion/sr=
c
> > > /main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOff=
s
> > > etToLineTokenConverterCtakesImpl.java?rev=3D1551805&r1=3D1551804&r2=
=3D155180
> > > 5&view=3Ddiff
> > > > >
> > > > >
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > =3D=3D=3D=3D=3D=3D=3D=3D
> > > > > ---
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes=
/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes=
I
> > > mpl.java
> > > > > (original)
> > > > > +++
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake
> > > > > +++
> > > s/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCta
> > > > > +++ kesImpl.java Wed Dec 18 02:14:13 2013
> > > > > @@ -32=2C8 +32=2C8 @@ import org.apache.uima.jcas.tcas.Annotat  i=
mport
> > > > > org.mitre.medfacts.i2b2.api.ApiConcept=3B
> > > > >  import
> > > > > org.mitre.medfacts.zoner.CharacterOffsetToLineTokenConverter=3B
> > > > >  import org.mitre.medfacts.zoner.LineAndTokenPosition=3B
> > > > > -
> > > > >  import org.apache.ctakes.typesystem.type.syntax.BaseToken=3B
> > > > > +import org.apache.ctakes.typesystem.type.syntax.NewlineToken=3B
> > > > >  import org.apache.ctakes.typesystem.type.textspan.Sentence=3B
> > > > >
> > > > >  public class CharacterOffsetToLineTokenConverterCtakesImpl
> > > > > implements CharacterOffsetToLineTokenConverter
> > > > > @@ -78=2C11 +78=2C13 @@ public class CharacterOffsetToLineTokenC
> > > > >           for (Annotation current : annotationIndex)
> > > > >           {
> > > > >                   BaseToken bt =3D (BaseToken)current=3B
> > > > > -                 int begin =3D bt.getBegin()=3B
> > > > > -                 int end =3D bt.getEnd()=3B
> > > > > -
> > > > > -                 tokenBeginEndTreeSet.add(begin)=3B
> > > > > -                 tokenBeginEndTreeSet.add(end)=3B
> > > > > +                 // filter out NewlineToken
> > > > > +                 if (!(bt instanceof NewlineToken)) {
> > > > > +                         int begin =3D bt.getBegin()=3B
> > > > > +                         int end =3D bt.getEnd()=3B
> > > > > +                         tokenBeginEndTreeSet.add(begin)=3B
> > > > > +                         tokenBeginEndTreeSet.add(end)=3B
> > > > > +                 }
> > > > >           }
> > > > >    }
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
> >
> >
=0A=
 		 	   		  =

--_df16c7f5-3337-46a5-b864-1dc57299d19d_--