ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: cTakes output predictability
Date Tue, 07 Oct 2014 15:58:39 GMT
FWIW, I agree with Sean that comparing should be a post-processing step and trying to get UIMA
internal IDs to match on subsequent runs is not worth opening the code for.

-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:56 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear
that there are any consequences with moving forward with changing the
code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
> The option Sean mentioned of writing your own custom consumer (without
> the UIMA id that is causing your issues) should meet these needs I
> believe. 
>
>   	  	  	 
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>
> On Oct 7, 2014, at 11:29 AM, Kim Ebert
> <kim.ebert@perfectsearchcorp.com
> <mailto:kim.ebert@perfectsearchcorp.com>> wrote:
>
>> Hi Sean,
>>
>> Well of course that makes plenty of sense. Testing different cTakes
>> configurations you would expect different output. In our testing we've
>> found several cases where running with the same configuration outputs
>> different data under different moons. Having consistent results helps us
>> know if we've made improvements to our quality or not. Having output
>> that is in a predictable order makes checking to see if there are
>> differences much cheaper when you are dealing with larger data sets.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>> One might want compare the Sentence detector that uses end of line
>>> characters as sentence splitters with one that does not.  Such a
>>> change in sentence splitting would not only effect the sentence type
>>> discoveries but also practically every type that follows.
>>>
>>> Another might want to compare a note with "skin cancer" vs. one in
>>> which you replace "skin cancer" with "melanoma" just to see what the
>>> CUI differences might be.  There are changes in two words vs. one,
>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>> in CUIs.
>>>
>>> Of course, if you are just running notes on a new moon and then
>>> again on a full moon ...
>>>
>>> Sean
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> Sean,
>>>
>>> "...being different because of a possibly intentional difference."
>>>
>>> I would like you to elaborate a bit on the what would be
>>> intentionally different between the processing of the same document
>>> multiple times. It would help my understanding of cTakes.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>> Steve Bethard wrote:
>>>>> I spent some time writing a script for diff-ing CASes
>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>> this type of approach.  Comparison of program output is a
>>>> post-process task, and unless absolutely necessary code to juggle
>>>> data and metadata belongs there.  Attempts to force every module
>>>> past, present and Future to abide by fixed orderings, enumerations
>>>> etc. is not as simple a task as one might initially think -
>>>> especially if third-party libraries are involved.  I won't get into
>>>> problems associated with why one is comparing output (swapped
>>>> module?) and IDs, orders etc. being different because of a possibly
>>>> intentional difference.
>>>>
>>>> In addition to or instead of creating a post-processing script, one
>>>> could write a new "cas-consumer" that writes output in a desired
>>>> format - but this should not require changes to engines.
>>>>
>>>> "If it ain't broke, don't fix it"
>>>>
>>>> Sean
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>> <bruce.tietjen@perfectsearchcorp.com> wrote:
>>>>> Since I started working with cTakes some time ago, I have found it
>>>>> difficult to compare the output between subsequent runs on the same
>>>>> files because annotations are often assigned different IDs, are
>>>>> listed in different order, etc.
>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>> that intended to address some of these kinds of issues. It's still
>>>> here in cTAKES:
>>>>
>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>> /CompareFeatureStructures.java
>>>>
>>>> You might see if you could use or adapt that to your needs.
>>>>
>>>> Steve
>>
>


Mime
View raw message