Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AF74173F1 for ; Tue, 7 Oct 2014 16:47:22 +0000 (UTC) Received: (qmail 93414 invoked by uid 500); 7 Oct 2014 16:47:22 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 93366 invoked by uid 500); 7 Oct 2014 16:47:22 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 93353 invoked by uid 99); 7 Oct 2014 16:47:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 16:47:21 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [134.174.13.91] (HELO mailsmtp1.childrenshospital.org) (134.174.13.91) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 16:46:56 +0000 Received: from pps.filterd (mailsmtp1.childrenshospital.org [127.0.0.1]) by mailsmtp1.childrenshospital.org (8.14.7/8.14.7) with SMTP id s97GiTrU029199 for ; Tue, 7 Oct 2014 12:46:33 -0400 Received: from smtpndc1.chboston.org (smtpndc1.chboston.org [10.20.50.104]) by mailsmtp1.childrenshospital.org with ESMTP id 1pv2j5c5hg-1 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 07 Oct 2014 12:46:33 -0400 Received: from pps.filterd (smtpndc1.chboston.org [127.0.0.1]) by smtpndc1.chboston.org (8.14.7/8.14.7) with SMTP id s97Gi2Ae010304 for ; Tue, 7 Oct 2014 12:46:32 -0400 Received: from chexhubcas4.chboston.org (internal-ndc-nat-v1260.tch.harvard.edu [10.20.50.4]) by smtpndc1.chboston.org with ESMTP id 1put3v54yc-2 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 07 Oct 2014 12:46:32 -0400 Received: from CHEXMBX4A.CHBOSTON.ORG ([fe80::39e4:467b:9f1b:f1e4]) by CHEXHUBCAS4.CHBOSTON.ORG ([::1]) with mapi id 14.03.0169.001; Tue, 7 Oct 2014 12:46:32 -0400 From: "Finan, Sean" To: "dev@ctakes.apache.org" Subject: RE: cTakes output predictability Thread-Topic: cTakes output predictability Thread-Index: AQHP4aiQHcr5ugaBakaG/iZ5I/MahZwkO3eAgABh1+CAAFu6AP//vVqwgABQNQCAAAHHAIAABeKA//+9XSA= Date: Tue, 7 Oct 2014 16:46:31 +0000 Message-ID: <393252F14C42F946952F1ED75D316CAD391746A8@CHEXMBX4A.CHBOSTON.ORG> References: <393252F14C42F946952F1ED75D316CAD391725AD@CHEXMBX4A.CHBOSTON.ORG> <5433FB60.9000004@perfectsearchcorp.com> <393252F14C42F946952F1ED75D316CAD391745E9@CHEXMBX4A.CHBOSTON.ORG> <543406C0.2040506@perfectsearchcorp.com> <14DC79DC-CD8B-48E3-8949-28B623F9A91A@wiredinformatics.com> <54340D2D.3030904@perfectsearchcorp.com> In-Reply-To: <54340D2D.3030904@perfectsearchcorp.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.7.2.38] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.12.52,1.0.28,0.0.0000 definitions=2014-10-07_05:2014-10-07,2014-10-07,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.12.52,1.0.28,0.0.0000 definitions=2014-10-07_05:2014-10-07,2014-10-07,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1410070156 X-Virus-Checked: Checked by ClamAV on apache.org Hi Kim, > It concerns me a bit by making the code return consistent results would b= e so concerning.=20 Could you please clarify what you mean by "consistent results"? Do you mea= n ordering and IDs or are you talking about actual type values not matching= ? >This should be the default mode of operation. Depending upon what you meant above, I may agree or disagree. > Since it doesn't appear that there are any consequences with moving forwa= rd with changing the code Why do you say this? =20 I think that there may be more required changes than you realize. Every in= sertion into the CAS must be of ordered data. This means that, for instanc= e, named entities discovered by dictionary will need to be inserted in some= predictable order, such as by alphabetized cui per every alphabetized tui = (and other code) per ordered text span. You will need to check and recheck= every point at which the CAS is modified by every module. Right now there= are at least three or four places in two cTakes dictionary modules where a= change would be required - and that doesn't include YTEX lookup. If you really feel strongly about this and are going to change cTakes code,= then I suggest (at the risk of sounding like a complete jerk) that you als= o consider the following: 1. Don't check anything into trunk until all is well with your changes and= tests Just in case you abandon the effort 2. Write unit tests for every change =20 True, Map to LinkedMap shouldn't break anything, but they are good to have,= and may prevent others in the future from switching back to a non-linked m= ap or any unordered collection (set not list, etc.). It also makes a bette= r place for explanation in Javadoc than inlines above the code. 3. Run memory requirement tests before all of your changes and then again = after your changes I'm actually curious about how much memory might be eaten with linkages eve= rywhere 4. Run performance (speed) tests before and after On a large corpus to ensure that garbage collection is involved 5. Do the above with every combination possible in current workflows: ever= y combination of available sentence detector, pos tagger, smoking status de= tector, dictionary lookup, cas consumer, etc. As soon as somebody says "all output is consistently ordered between runs" = it had better be so for every possible workflow 6. Write system tests to ensure ordered/predicted outputs with each combin= ation Otherwise somebody may break it 7. Document the what, how, and why for future development Otherwise somebody won't know to stick to the new rules 8. Assist anybody as needed that in the future breaks one of these unit or= system tests with a fix or new feature By mandating such a rule you are assuming responsibility for it 9. Assist anybody as needed that in the future adds a new module or workfl= ow to cTakes to abide by the ordering requirement By mandating such a rule you are assuming responsibility for it 10. Assist anybody as needed that in the future adds a new module or workf= low to add system tests to ensure maintenance of the ordering requirement By mandating such a rule you are assuming responsibility for it -----Original Message----- From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]=20 Sent: Tuesday, October 07, 2014 11:57 AM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability I think we may really prefer the first method. Since it doesn't appear that= there are any consequences with moving forward with changing the code, we = would really like to move forward with this approach. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:35 AM, britt fitch wrote: > The option Sean mentioned of writing your own custom consumer (without=20 > the UIMA id that is causing your issues) should meet these needs I=20 > believe. > > =20 > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com > Britt.Fitch@wiredinformatics.com > > On Oct 7, 2014, at 11:29 AM, Kim Ebert=20 > > wrote: > >> Hi Sean, >> >> Well of course that makes plenty of sense. Testing different cTakes=20 >> configurations you would expect different output. In our testing=20 >> we've found several cases where running with the same configuration=20 >> outputs different data under different moons. Having consistent=20 >> results helps us know if we've made improvements to our quality or=20 >> not. Having output that is in a predictable order makes checking to=20 >> see if there are differences much cheaper when you are dealing with larg= er data sets. >> >> Kim Ebert >> 1.801.669.7342 >> Perfect Search Corp >> http://www.perfectsearchcorp.com/ >> >> On 10/07/2014 08:50 AM, Finan, Sean wrote: >>> Hi Kim, >>> >>> One might want compare the Sentence detector that uses end of line=20 >>> characters as sentence splitters with one that does not. Such a=20 >>> change in sentence splitting would not only effect the sentence type=20 >>> discoveries but also practically every type that follows. >>> >>> Another might want to compare a note with "skin cancer" vs. one in=20 >>> which you replace "skin cancer" with "melanoma" just to see what the=20 >>> CUI differences might be. There are changes in two words vs. one, >>> 11 characters vs. 8, a removed adjective(?), and of course changes=20 >>> in CUIs. >>> >>> Of course, if you are just running notes on a new moon and then=20 >>> again on a full moon ... >>> >>> Sean >>> >>> -----Original Message----- >>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] >>> Sent: Tuesday, October 07, 2014 10:41 AM >>> To: dev@ctakes.apache.org >>> Subject: Re: cTakes output predictability >>> >>> Sean, >>> >>> "...being different because of a possibly intentional difference." >>> >>> I would like you to elaborate a bit on the what would be=20 >>> intentionally different between the processing of the same document=20 >>> multiple times. It would help my understanding of cTakes. >>> >>> Thanks, >>> >>> Kim Ebert >>> 1.801.669.7342 >>> Perfect Search Corp >>> http://www.perfectsearchcorp.com/ >>> >>> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>>> Steve Bethard wrote: >>>>> I spent some time writing a script for diff-ing CASes >>>> I urge anyone interested in comparing cTakes CASes / output to use=20 >>>> this type of approach. Comparison of program output is a=20 >>>> post-process task, and unless absolutely necessary code to juggle=20 >>>> data and metadata belongs there. Attempts to force every module=20 >>>> past, present and Future to abide by fixed orderings, enumerations=20 >>>> etc. is not as simple a task as one might initially think -=20 >>>> especially if third-party libraries are involved. I won't get into=20 >>>> problems associated with why one is comparing output (swapped >>>> module?) and IDs, orders etc. being different because of a possibly=20 >>>> intentional difference. >>>> >>>> In addition to or instead of creating a post-processing script, one=20 >>>> could write a new "cas-consumer" that writes output in a desired=20 >>>> format - but this should not require changes to engines. >>>> >>>> "If it ain't broke, don't fix it" >>>> >>>> Sean >>>> >>>> >>>> -----Original Message----- >>>> From: Steven Bethard [mailto:steven.bethard@gmail.com] >>>> Sent: Monday, October 06, 2014 11:23 PM >>>> To: dev@ctakes.apache.org >>>> Subject: Re: cTakes output predictability >>>> >>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen=20 >>>> wrote: >>>>> Since I started working with cTakes some time ago, I have found it=20 >>>>> difficult to compare the output between subsequent runs on the=20 >>>>> same files because annotations are often assigned different IDs,=20 >>>>> are listed in different order, etc. >>>> At one point, I spent some time writing a script for diff-ing CASes=20 >>>> that intended to address some of these kinds of issues. It's still=20 >>>> here in cTAKES: >>>> >>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy >>>> sis >>>> /CompareFeatureStructures.java >>>> >>>> You might see if you could use or adapt that to your needs. >>>> >>>> Steve >> >