ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen, Pei" <Pei.C...@childrens.harvard.edu>
Subject RE: cTakes Annotation Comparison
Date Thu, 18 Dec 2014 22:37:07 GMT
Thanks for this-- very useful.
Perhaps Sean Finan comment more- 
but it's also probably worth it to compare to an adjudicated human annotated gold standard.


-----Original Message-----
From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com] 
Sent: Thursday, December 18, 2014 1:45 PM
To: dev@ctakes.apache.org
Subject: cTakes Annotation Comparison

With the recent release of cTakes 3.2.1, we were very interested in checking for any differences
in annotations between using the AggregatePlaintextUMLSProcessor pipeline and the AggregatePlanetextFastUMLSProcessor
pipeline within this release of cTakes with its associated set of UMLS resources.

We chose to use the SHARE 14-a-b Training data that consists of 199 documents (Discharge 
61, ECG 54, Echo 42 and Radiology 42) as the basis for the comparison.

We decided to share a summary of the results with the development community.

Documents Processed: 199

Processing Time:
UMLSProcessor           2,439 seconds
FastUMLSProcessor    1,837 seconds

Total Annotations Reported:
UMLSProcessor                  20,365 annotations
FastUMLSProcessor             8,284 annotations

Annotation Comparisons:
Annotations common to both sets:                                  3,940
Annotations reported only by the UMLSProcessor:         16,425
Annotations reported only by the FastUMLSProcessor:    4,344

If anyone is interested, following was our test procedure:

We used the UIMA CPE to process the document set twice, once using the AggregatePlaintextUMLSProcessor
pipeline and once using the AggregatePlaintextFastUMLSProcessor pipeline. We used the WriteCAStoFile
CAS consumer to write the results to output files.

We used a tool we recently developed to analyze and compare the annotations generated by the
two pipelines. The tool compares the two outputs for each file and reports any differences
in the annotations (MedicationMention, SignSymptomMention, ProcedureMention, AnatomicalSiteMention,
DiseaseDisorderMention) between the two output sets. The tool reports the number of 'matches'
and 'misses' between each annotation set. A 'match' is defined as the presence of an identified
source text interval with its associated CUI appearing in both annotation sets. A 'miss' is
defined as the presence of an identified source text interval and its associated CUI in one
annotation set, but no matching identified source text interval and CUI in the other. The
tool also reports the total number of annotations (source text intervals with associated CUIs)
reported in each annotation set. The compare tool is in our GitHub repository at https://github.com/perfectsearch/cTAKES-compare
View raw message