ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: The fast dictionary pipeline vs. the regular one
Date Mon, 22 Jun 2015 14:12:34 GMT
Hi all,

I’m glad that there continues to be interest in the fast alternative to the dictionary lookup
and I welcome all testing.

GBM actually is Glioblastoma Multiforme – hence the “M”.   The WHO name is the abbreviated
“Glioblastoma”, but they are actually not (as far as I can discern) different things.
 If you check the metathesaurus 2011ab, GBM brings up both Glioblastoma C0017636 and Glioblastoma
Multiforme C1621958.  The first comes from Mesh and NCI, the second from CSP.  If you look
at the definitions they are synonymous: “malignant form of astrocytoma histologically characterized
by pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise in any region
of the central nervous system, with a predilection for the cerebral hemispheres, basal ganglia,
and commissural pathways.”  Mapping to a different CUI in the UMLS does not always mean
that they are truly different concepts.  It often means that they came from 2 different source
dictionaries (such as in this case).  Also check https://en.wikipedia.org/wiki/Glioblastoma_multiforme
 But I am a little confused: are you saying that you got only Glioblastoma Multiforme C1621958
and not Glioblastoma C0017636 ?  When I run it I get both returns …

Britt is correct (thank you) in that if you change the default minimum span from 3 to 2 you
will get Cutaneous Mastocytosis C1136033 within “5.5 cm”.  The minimum span is 3 (not
2) to prevent things like the obviously garbage return of Cutaneous Mastocytosis for every
“cm”.  However, feel free to change it to fit your purposes.  2 characters is the minimum
– you cannot lookup 1 character terms with the default dictionary.  You can do so with a
custom dictionary if you like – which might be useful if you just have 1 or 2 single-character


From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Monday, June 22, 2015 9:24 AM
To: dev@ctakes.apache.org
Subject: Re: The fast dictionary pipeline vs. the regular one

Regarding the miss on “cm” in #2, you might want to check out the dictionary xml descriptor
or uimafit wiring, depending on which you are using, for the parameter “minimumSpan”.
If I recall correctly the default minimum span is 3 characters, however you can reduce it
to 2 if desired.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110

On Jun 21, 2015, at 2:45 PM, Miller, Timothy <Timothy.Miller@childrens.harvard.edu<mailto:Timothy.Miller@childrens.harvard.edu>>

Sean wrote the fast version and may be able to answer your specific questions. But in general,
the fast dictionary does not match performance exactly -- it is not implementing an equivalent
search and it has different indexing methods. We are happy to receive reports of what seem
like bugs, though, any new software is likely to have some. What I will say is that I know
Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the
new system output is at least as high quality as the older one.

From: Oranit Dror [oranit@algotec.co.il<mailto:oranit@algotec.co.il>]
Sent: Sunday, June 21, 2015 4:37 AM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: The fast dictionary pipeline vs. the regular one


I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary
pipeline and indeed it is much faster.
However, I have encountered with several quality differences in the returned annotations.
For example:

1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme",
while in the regular pipeline it is annotated as "glioblastoma".
Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma
multiforme" is mapped to a narrower concept.

2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline
as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical
term (as expected and as in UMLS).

Any explanation for the differences?

Thank you,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message