Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12570172C3 for ; Tue, 7 Oct 2014 16:07:51 +0000 (UTC) Received: (qmail 85107 invoked by uid 500); 7 Oct 2014 16:07:51 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 85053 invoked by uid 500); 7 Oct 2014 16:07:50 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 85042 invoked by uid 99); 7 Oct 2014 16:07:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 16:07:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.48] (HELO mail-pa0-f48.google.com) (209.85.220.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 16:07:23 +0000 Received: by mail-pa0-f48.google.com with SMTP id eu11so7332166pac.7 for ; Tue, 07 Oct 2014 09:07:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type; bh=fzd49KF5jGpbQHclj4MDcKm/M7LO+KYju08RvSNWpF0=; b=I4Li9rypnxkLMetFupfmn6De+YP10V4VuR48D1pSCC1U/uPaW818DnQ2fB59Qc9Ffw SE5Q6UsLF7d07x7FbB0EXQDnRkCpVGMVTkZ4tiZtzzorENQXfUVA4g/X48nFoo1pVmDk mMhyqZyp/sBqn1ScELWgk+++PrQT6g4xI6o1d61dKQ6nYGd1JrvepjFgAP4PgTMD/x04 DGut11kPsyDudO/a0gsRTvgDPvsOZDu+VQ2JBcqwLr5IJwlMsXHVcLYsZ8wVHFOeeTLd jxTOdTAD+/fe+h47HmAZ0Yx20gDTOfttEaBWu5sv+8gNfVBqRId+AsjDcFAeStzwevUZ qNsg== X-Gm-Message-State: ALoCoQnmtfs3CstCL4x3njfSnqb/nRXqIKGHgHMaiiMo7UNklq+OerozsOe1b02gosnnbAT0XzDu X-Received: by 10.66.179.140 with SMTP id dg12mr4507018pac.76.1412698041848; Tue, 07 Oct 2014 09:07:21 -0700 (PDT) Received: from localhost.localdomain (184-155-223-24.cpe.cableone.net. [184.155.223.24]) by mx.google.com with ESMTPSA id hz4sm16387226pbc.22.2014.10.07.09.07.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 07 Oct 2014 09:07:21 -0700 (PDT) Message-ID: <54340FB7.7000900@perfectsearchcorp.com> Date: Tue, 07 Oct 2014 10:07:19 -0600 From: Kim Ebert User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0 MIME-Version: 1.0 To: dev@ctakes.apache.org Subject: Re: cTakes output predictability References: <393252F14C42F946952F1ED75D316CAD391725AD@CHEXMBX4A.CHBOSTON.ORG> <5433FB60.9000004@perfectsearchcorp.com> <393252F14C42F946952F1ED75D316CAD391745E9@CHEXMBX4A.CHBOSTON.ORG> <543406C0.2040506@perfectsearchcorp.com> <14DC79DC-CD8B-48E3-8949-28B623F9A91A@wiredinformatics.com> <54340D2D.3030904@perfectsearchcorp.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------070209010707050602050106" X-Virus-Checked: Checked by ClamAV on apache.org --------------070209010707050602050106 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit It concerns me a bit by making the code return consistent results would be so concerning. This should be the default mode of operation. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:59 AM, britt fitch wrote: > I think changing the code raises at least some concerns of affecting > others, while adding a custom consumer raises zero. Given how easy it > is to write a custom consumer, that is my vote. > > > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com > Britt.Fitch@wiredinformatics.com > > On Oct 7, 2014, at 11:56 AM, Kim Ebert > > wrote: > >> I think we may really prefer the first method. Since it doesn't appear >> that there are any consequences with moving forward with changing the >> code, we would really like to move forward with this approach. >> >> Kim Ebert >> 1.801.669.7342 >> Perfect Search Corp >> http://www.perfectsearchcorp.com/ >> >> On 10/07/2014 09:35 AM, britt fitch wrote: >>> The option Sean mentioned of writing your own custom consumer (without >>> the UIMA id that is causing your issues) should meet these needs I >>> believe. >>> >>> >>> >>> Britt Fitch >>> Wired Informatics >>> 265 Franklin St Ste 1702 >>> Boston, MA 02110 >>> http://wiredinformatics.com >>> Britt.Fitch@wiredinformatics.com >>> >>> On Oct 7, 2014, at 11:29 AM, Kim Ebert >>> >> > wrote: >>> >>>> Hi Sean, >>>> >>>> Well of course that makes plenty of sense. Testing different cTakes >>>> configurations you would expect different output. In our testing we've >>>> found several cases where running with the same configuration outputs >>>> different data under different moons. Having consistent results >>>> helps us >>>> know if we've made improvements to our quality or not. Having output >>>> that is in a predictable order makes checking to see if there are >>>> differences much cheaper when you are dealing with larger data sets. >>>> >>>> Kim Ebert >>>> 1.801.669.7342 >>>> Perfect Search Corp >>>> http://www.perfectsearchcorp.com/ >>>> >>>> On 10/07/2014 08:50 AM, Finan, Sean wrote: >>>>> Hi Kim, >>>>> >>>>> One might want compare the Sentence detector that uses end of line >>>>> characters as sentence splitters with one that does not. Such a >>>>> change in sentence splitting would not only effect the sentence type >>>>> discoveries but also practically every type that follows. >>>>> >>>>> Another might want to compare a note with "skin cancer" vs. one in >>>>> which you replace "skin cancer" with "melanoma" just to see what the >>>>> CUI differences might be. There are changes in two words vs. one, >>>>> 11 characters vs. 8, a removed adjective(?), and of course changes >>>>> in CUIs. >>>>> >>>>> Of course, if you are just running notes on a new moon and then >>>>> again on a full moon ... >>>>> >>>>> Sean >>>>> >>>>> -----Original Message----- >>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] >>>>> Sent: Tuesday, October 07, 2014 10:41 AM >>>>> To: dev@ctakes.apache.org >>>>> Subject: Re: cTakes output predictability >>>>> >>>>> Sean, >>>>> >>>>> "...being different because of a possibly intentional difference." >>>>> >>>>> I would like you to elaborate a bit on the what would be >>>>> intentionally different between the processing of the same document >>>>> multiple times. It would help my understanding of cTakes. >>>>> >>>>> Thanks, >>>>> >>>>> Kim Ebert >>>>> 1.801.669.7342 >>>>> Perfect Search Corp >>>>> http://www.perfectsearchcorp.com/ >>>>> >>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>>>>> Steve Bethard wrote: >>>>>>> I spent some time writing a script for diff-ing CASes >>>>>> I urge anyone interested in comparing cTakes CASes / output to use >>>>>> this type of approach. Comparison of program output is a >>>>>> post-process task, and unless absolutely necessary code to juggle >>>>>> data and metadata belongs there. Attempts to force every module >>>>>> past, present and Future to abide by fixed orderings, enumerations >>>>>> etc. is not as simple a task as one might initially think - >>>>>> especially if third-party libraries are involved. I won't get into >>>>>> problems associated with why one is comparing output (swapped >>>>>> module?) and IDs, orders etc. being different because of a possibly >>>>>> intentional difference. >>>>>> >>>>>> In addition to or instead of creating a post-processing script, one >>>>>> could write a new "cas-consumer" that writes output in a desired >>>>>> format - but this should not require changes to engines. >>>>>> >>>>>> "If it ain't broke, don't fix it" >>>>>> >>>>>> Sean >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com] >>>>>> Sent: Monday, October 06, 2014 11:23 PM >>>>>> To: dev@ctakes.apache.org >>>>>> Subject: Re: cTakes output predictability >>>>>> >>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen >>>>>> wrote: >>>>>>> Since I started working with cTakes some time ago, I have found it >>>>>>> difficult to compare the output between subsequent runs on the same >>>>>>> files because annotations are often assigned different IDs, are >>>>>>> listed in different order, etc. >>>>>> At one point, I spent some time writing a script for diff-ing CASes >>>>>> that intended to address some of these kinds of issues. It's still >>>>>> here in cTAKES: >>>>>> >>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis >>>>>> /CompareFeatureStructures.java >>>>>> >>>>>> You might see if you could use or adapt that to your needs. >>>>>> >>>>>> Steve > --------------070209010707050602050106--