Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5144617FA4 for ; Tue, 7 Oct 2014 15:35:55 +0000 (UTC) Received: (qmail 14892 invoked by uid 500); 7 Oct 2014 15:35:55 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 14856 invoked by uid 500); 7 Oct 2014 15:35:55 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 14845 invoked by uid 99); 7 Oct 2014 15:35:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 15:35:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.181] (HELO mail-qc0-f181.google.com) (209.85.216.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 15:35:50 +0000 Received: by mail-qc0-f181.google.com with SMTP id r5so5894996qcx.40 for ; Tue, 07 Oct 2014 08:35:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=Gl+ufkqmk8976kMGkgyWy9pSqBRuqsmWy94+6vuFxhM=; b=B9kk2Kfag02p+3R05gNH7sALjPCMzRo7go3RxKbqygvMzJbmz+x5Hd1zBqqv53lHde fO9y4qsXokqa7ODdKfIGvBbmwB1kWGb7kmu+H9rMcv9GHIyqRyn/tZWVkCiGts9ZF3BT a0Vr/SSYXKoVz776YUzly9Ce8ehn/DPICzsAHJ6PVQygXzIp1sXwnHcy7ZcYmGW3h9TG BWiqB+OYeSgi2jkXU1SpzQ4bGjTZAbvWAJPFwOLXmanmSR7FFgl6XtvB4HirZsO56DBr Hv6aDsPen50cIV2H0LHLlbWEITm6sPYR37O7Cf06+0me0pnZaM9MsbhMFT27dZB7fNtZ /mrw== X-Gm-Message-State: ALoCoQkiTxWtAXUSuMVBwqptTZrEA/+Q4icVmeTU3jBUWGXy/ZULNuT8ulwnwtc7RpMhD220cSq8 X-Received: by 10.224.162.196 with SMTP id w4mr5294779qax.61.1412696129445; Tue, 07 Oct 2014 08:35:29 -0700 (PDT) Received: from ?IPv6:2601:6:3000:991:e48e:17f6:f33e:c3b0? ([2601:6:3000:991:e48e:17f6:f33e:c3b0]) by mx.google.com with ESMTPSA id e9sm14983507qgd.46.2014.10.07.08.35.27 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 07 Oct 2014 08:35:28 -0700 (PDT) From: britt fitch Content-Type: multipart/signed; boundary="Apple-Mail=_3AB72B31-6AEE-40FA-B32A-3078F1E1FBE4"; protocol="application/pgp-signature"; micalg=pgp-sha512 Message-Id: <14DC79DC-CD8B-48E3-8949-28B623F9A91A@wiredinformatics.com> Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: cTakes output predictability Date: Tue, 7 Oct 2014 11:35:26 -0400 References: <393252F14C42F946952F1ED75D316CAD391725AD@CHEXMBX4A.CHBOSTON.ORG> <5433FB60.9000004@perfectsearchcorp.com> <393252F14C42F946952F1ED75D316CAD391745E9@CHEXMBX4A.CHBOSTON.ORG> <543406C0.2040506@perfectsearchcorp.com> To: dev@ctakes.apache.org In-Reply-To: <543406C0.2040506@perfectsearchcorp.com> X-Mailer: Apple Mail (2.1878.6) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_3AB72B31-6AEE-40FA-B32A-3078F1E1FBE4 Content-Type: multipart/alternative; boundary="Apple-Mail=_F2118E18-7029-4AAF-9263-6D6D76B0E70E" --Apple-Mail=_F2118E18-7029-4AAF-9263-6D6D76B0E70E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii The option Sean mentioned of writing your own custom consumer (without = the UIMA id that is causing your issues) should meet these needs I = believe.=20 =20 Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com Britt.Fitch@wiredinformatics.com On Oct 7, 2014, at 11:29 AM, Kim Ebert = wrote: > Hi Sean, >=20 > Well of course that makes plenty of sense. Testing different cTakes > configurations you would expect different output. In our testing we've > found several cases where running with the same configuration outputs > different data under different moons. Having consistent results helps = us > know if we've made improvements to our quality or not. Having output > that is in a predictable order makes checking to see if there are > differences much cheaper when you are dealing with larger data sets. >=20 > Kim Ebert > 1.801.669.7342 > Perfect Search Corp > http://www.perfectsearchcorp.com/ >=20 > On 10/07/2014 08:50 AM, Finan, Sean wrote: >> Hi Kim, >>=20 >> One might want compare the Sentence detector that uses end of line = characters as sentence splitters with one that does not. Such a change = in sentence splitting would not only effect the sentence type = discoveries but also practically every type that follows. >>=20 >> Another might want to compare a note with "skin cancer" vs. one in = which you replace "skin cancer" with "melanoma" just to see what the CUI = differences might be. There are changes in two words vs. one, 11 = characters vs. 8, a removed adjective(?), and of course changes in CUIs. >>=20 >> Of course, if you are just running notes on a new moon and then again = on a full moon ... >>=20 >> Sean >>=20 >> -----Original Message----- >> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]=20 >> Sent: Tuesday, October 07, 2014 10:41 AM >> To: dev@ctakes.apache.org >> Subject: Re: cTakes output predictability >>=20 >> Sean, >>=20 >> "...being different because of a possibly intentional difference." >>=20 >> I would like you to elaborate a bit on the what would be = intentionally different between the processing of the same document = multiple times. It would help my understanding of cTakes. >>=20 >> Thanks, >>=20 >> Kim Ebert >> 1.801.669.7342 >> Perfect Search Corp >> http://www.perfectsearchcorp.com/ >>=20 >> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>> Steve Bethard wrote: >>>> I spent some time writing a script for diff-ing CASes >>> I urge anyone interested in comparing cTakes CASes / output to use = this type of approach. Comparison of program output is a post-process = task, and unless absolutely necessary code to juggle data and metadata = belongs there. Attempts to force every module past, present and Future = to abide by fixed orderings, enumerations etc. is not as simple a task = as one might initially think - especially if third-party libraries are = involved. I won't get into problems associated with why one is = comparing output (swapped module?) and IDs, orders etc. being different = because of a possibly intentional difference. >>>=20 >>> In addition to or instead of creating a post-processing script, one = could write a new "cas-consumer" that writes output in a desired format = - but this should not require changes to engines. >>>=20 >>> "If it ain't broke, don't fix it" >>>=20 >>> Sean >>>=20 >>>=20 >>> -----Original Message----- >>> From: Steven Bethard [mailto:steven.bethard@gmail.com] >>> Sent: Monday, October 06, 2014 11:23 PM >>> To: dev@ctakes.apache.org >>> Subject: Re: cTakes output predictability >>>=20 >>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen=20 >>> wrote: >>>> Since I started working with cTakes some time ago, I have found it=20= >>>> difficult to compare the output between subsequent runs on the same=20= >>>> files because annotations are often assigned different IDs, are=20 >>>> listed in different order, etc. >>> At one point, I spent some time writing a script for diff-ing CASes=20= >>> that intended to address some of these kinds of issues. It's still=20= >>> here in cTAKES: >>>=20 >>> = ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis >>> /CompareFeatureStructures.java >>>=20 >>> You might see if you could use or adapt that to your needs. >>>=20 >>> Steve >=20 --Apple-Mail=_F2118E18-7029-4AAF-9263-6D6D76B0E70E Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii The = option Sean mentioned of writing your own custom consumer (without the = UIMA id that is causing your issues) should meet these needs I = believe. 

    
Britt Fitch
Wired = Informatics
265 Franklin St Ste 1702
Boston, MA = 02110
http://wiredinformatics.com
Br= itt.Fitch@wiredinformatics.com

On Oct 7, 2014, at 11:29 AM, Kim Ebert <kim.ebert@perfectsearchcor= p.com> wrote:

Hi = Sean,

Well of course that makes plenty of sense. Testing = different cTakes
configurations you would expect different output. In = our testing we've
found several cases where running with the same = configuration outputs
different data under different moons. Having = consistent results helps us
know if we've made improvements to our = quality or not. Having output
that is in a predictable order makes = checking to see if there are
differences much cheaper when you are = dealing with larger data sets.

Kim = Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.co= m/

On 10/07/2014 08:50 AM, Finan, Sean wrote:
Hi Kim,

One might want compare the Sentence = detector that uses end of line characters as sentence splitters with one = that does not.  Such a change in sentence splitting would not only = effect the sentence type discoveries but also practically every type = that follows.

Another might want to compare a note with "skin = cancer" vs. one in which you replace "skin cancer" with "melanoma" just = to see what the CUI differences might be.  There are changes in two = words vs. one, 11 characters vs. 8, a removed adjective(?), and of = course changes in CUIs.

Of course, if you are just running notes = on a new moon and then again on a full moon = ...

Sean

-----Original Message-----
From: Kim Ebert = [mailto:kim.ebert@perfectsearchcorp.com]
Sent: Tuesday, October 07, = 2014 10:41 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output = predictability

Sean,

"...being different because of a = possibly intentional difference."

I would like you to elaborate a = bit on the what would be intentionally different between the processing = of the same document multiple times. It would help my understanding of = cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect = Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 = 07:30 AM, Finan, Sean wrote:
Steve Bethard = wrote:
I spent some time writing a script = for diff-ing CASes
I urge anyone interested in comparing = cTakes CASes / output to use this type of approach.  Comparison of = program output is a post-process task, and unless absolutely necessary = code to juggle data and metadata belongs there.  Attempts to force = every module past, present and Future to abide by fixed orderings, = enumerations etc. is not as simple a task as one might initially think - = especially if third-party libraries are involved.  I won't get into = problems associated with why one is comparing output (swapped module?) = and IDs, orders etc. being different because of a possibly intentional = difference.

In addition to or instead of creating a = post-processing script, one could write a new "cas-consumer" that writes = output in a desired format - but this should not require changes to = engines.

"If it ain't broke, don't fix = it"

Sean


-----Original Message-----
From: Steven = Bethard [mailto:steven.bethard@gmail.com]
Sent: Monday, October 06, = 2014 11:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output = predictability

On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen =
<bruce.tietjen@perfectsearchcorp.com> wrote:
Since I started working with cTakes some time ago, I have = found it
difficult to compare the output between subsequent runs on = the same
files because annotations are often assigned different IDs, = are
listed in different order, etc.
At one point, I = spent some time writing a script for diff-ing CASes
that intended to = address some of these kinds of issues. It's still
here in = cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/da= ta/analysis
/CompareFeatureStructures.java

You might see if = you could use or adapt that to your = needs.

Steve

= --Apple-Mail=_F2118E18-7029-4AAF-9263-6D6D76B0E70E-- --Apple-Mail=_3AB72B31-6AEE-40FA-B32A-3078F1E1FBE4 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJUNAg+AAoJEN9mP6PnnMibKmAH/R6MRRsz/Fy5qri4s9a7jMBQ N1kXhEDPJVSULLawOVopPYjwFJWf8HWn+1PnMmTQaENnTwJ+KSeFeQdDLs+ZJcE+ WyGyXDb8AIv8xK6Jdy+NgWEWkle3NTJJ0+BkDPoWpl5bDmxVY4CVFvdxq98f2WtJ q3CXOtjXvFF0JqlBRv0O1N4pGNL1HdSTHl9IeAg3Vocqbf5s044WZ5lJHBD02Osc WHSKUUBTOhyzFKVgyzhhYz0cEaeSd9dIs7ye+WiZ5Gt0bRJjEiB6r0llTwdhe1Mf EFAnMmSa7cawE4uJ4fCJN+XLn2E1i3QGdlsM1hfqMUkYyit/g5i6cS/kYk/uQ+8= =595Y -----END PGP SIGNATURE----- --Apple-Mail=_3AB72B31-6AEE-40FA-B32A-3078F1E1FBE4--