ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: cTakes Annotation Comparison
Date Fri, 19 Dec 2014 13:58:59 GMT
Thanks for this, Bruce! Very interesting work. It confirms what I've seen
in my small tests that I've done in a non-systematic way. Did you happen to
capture the number of false positives yet (annotations made by cTAKES that
are not in the human adjudicated standard)? I've seen a lot of dictionary
hits that are not actually entity mentions, but I haven't had a chance to
do a systematic analysis (we're working on our annotated gold standard
now). One great example is the antibiotic "Today". Every time the word
today appears in any text it is annotated as a medication mention when it
almost never is being used in that sense.

These results by themselves are quite disappointing to me. Both the
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor
recall. It seems like the trade off for more speed is a ten-fold (or more)
decrease in entity recognition.

Thanks again for sharing your results with us. I think they are very useful
to the project.

- Dave

On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen <
bruce.tietjen@perfectsearchcorp.com> wrote:
>
> Actually, we are working on a similar tool to compare it to the human
> adjudicated standard for the set we tested against.  I didn't mention it
> before because the tool isn't complete yet, but initial results for the set
> (excluding those marked as "CUI-less") was as follows:
>
> Human adjudicated annotations: 4591 (excluding CUI-less)
>
> Annotations found matching the human adjudicated standard
> UMLSProcessor                  2245
> FastUMLSProcessor           215
>
>
>
>
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tietjen@imatsolutions.com
>
> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei <Pei.Chen@childrens.harvard.edu
> >
> wrote:
> >
> > Bruce,
> > Thanks for this-- very useful.
> > Perhaps Sean Finan comment more-
> > but it's also probably worth it to compare to an adjudicated human
> > annotated gold standard.
> >
> > --Pei
> >
> > -----Original Message-----
> > From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com]
> > Sent: Thursday, December 18, 2014 1:45 PM
> > To: dev@ctakes.apache.org
> > Subject: cTakes Annotation Comparison
> >
> > With the recent release of cTakes 3.2.1, we were very interested in
> > checking for any differences in annotations between using the
> > AggregatePlaintextUMLSProcessor pipeline and the
> > AggregatePlanetextFastUMLSProcessor pipeline within this release of
> cTakes
> > with its associated set of UMLS resources.
> >
> > We chose to use the SHARE 14-a-b Training data that consists of 199
> > documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the basis
> > for the comparison.
> >
> > We decided to share a summary of the results with the development
> > community.
> >
> > Documents Processed: 199
> >
> > Processing Time:
> > UMLSProcessor           2,439 seconds
> > FastUMLSProcessor    1,837 seconds
> >
> > Total Annotations Reported:
> > UMLSProcessor                  20,365 annotations
> > FastUMLSProcessor             8,284 annotations
> >
> >
> > Annotation Comparisons:
> > Annotations common to both sets:                                  3,940
> > Annotations reported only by the UMLSProcessor:         16,425
> > Annotations reported only by the FastUMLSProcessor:    4,344
> >
> >
> > If anyone is interested, following was our test procedure:
> >
> > We used the UIMA CPE to process the document set twice, once using the
> > AggregatePlaintextUMLSProcessor pipeline and once using the
> > AggregatePlaintextFastUMLSProcessor pipeline. We used the WriteCAStoFile
> > CAS consumer to write the results to output files.
> >
> > We used a tool we recently developed to analyze and compare the
> > annotations generated by the two pipelines. The tool compares the two
> > outputs for each file and reports any differences in the annotations
> > (MedicationMention, SignSymptomMention, ProcedureMention,
> > AnatomicalSiteMention, and
> > DiseaseDisorderMention) between the two output sets. The tool reports the
> > number of 'matches' and 'misses' between each annotation set. A 'match'
> is
> > defined as the presence of an identified source text interval with its
> > associated CUI appearing in both annotation sets. A 'miss' is defined as
> > the presence of an identified source text interval and its associated CUI
> > in one annotation set, but no matching identified source text interval
> and
> > CUI in the other. The tool also reports the total number of annotations
> > (source text intervals with associated CUIs) reported in each annotation
> > set. The compare tool is in our GitHub repository at
> > https://github.com/perfectsearch/cTAKES-compare
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message