ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: cTAKES false positives, case-insensitivity
Date Wed, 01 Jun 2016 21:26:47 GMT
Hi Tomasz,
The gui doesn't yet have the capability to use different data/ directories.  However, you
can copy the "tiny/" directory contents into "default/" and get almost the same thing - you
probably noticed that the data/ directory also exists in the gui.  A few things were improved
in the gui version of the creator and the "official" ctakes version uses 2011AB, so the numbers
will not be exact.

Sean

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu] 
Sent: Wednesday, June 01, 2016 4:38 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES false positives, case-insensitivity

Sean,

Thank you for your answers, I really appreciate it.

Using the default setting, the dictionary-gui for me creates a 113 MB big customumls2015.script,
with 1022649 rows in CUI_TERMS and distinct CUIs: 356056.

Using the same source of UMLS (2015 level 0 and Snomed) and the older dictionarytool.jar,
I can customize the flags such as -atui ./data/tiny/CtakesAnatTuis.txt or -fd ./data/tiny
and have 536821 rows in CUI_TERMS and distinct CUIs: 225867, which is close to the numbers
of the official cTAKES umls dictionary. 

Can I give some parameters or change foldernames for the dictionary-gui to get a similar numbers?

Thanks,
Tomasz
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, June 01, 2016 2:40 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES false positives, case-insensitivity

Hi Tomasz,
The change to lowercase is also done in the dictionary code.
Unless you want to make a database for the previous dictionary lookup module (it looks like
you don't), you shouldn't bother with the old dictionarytool.jar Use the newer dictionary-gui
in sandbox instead.
The class there is org.apache.ctakes.dictionary.creator.util.TextTokenizer
In the getTokenizedText(..) method, line 177, just remove the .toLowerCase()

In the ctakes -fast module code you will need to replace the ...dictionary.lookup2.util.FastLookuptoken
and remove the .toLowerCase() from the constructor method, line 45.  You cannot extend that
class as it is immutable.

Sean

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, June 01, 2016 3:20 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES false positives, case-insensitivity

Another idea would be to create the dictionary without lowercasing the concept text and rare
word in CUI_TERMS, but keep them as they are from the UMLS.

Do you happen to know which class / line is responsible for the lowercasing in the dictionarytool.jar
? I could like to try this.

Regards,
Tomasz

________________________________________
From: Tomasz Oliwa [oliwa@uchicago.edu]
Sent: Wednesday, June 01, 2016 11:07 AM
To: dev@ctakes.apache.org
Subject: RE: cTAKES false positives, case-insensitivity

Thank you all for the suggestions.

Sean, by "make the AE case-sensitive" do you mean writing an annotator that simply removes
an annotation based on some criteria like case and semantic type? Or does cTAKES have such
a switch already available?

________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, June 01, 2016 10:56 AM
To: dev@ctakes.apache.org
Subject: RE: cTAKES false positives, case-insensitivity

Oh - I should mention:
Increasing the minimum required span cause have unwanted false negatives.  A minimum of 5
will get rid of things like "arm" and "foot".  You could make your own AE that changes this
by getting rid of only disease/disorder with character count < 5 .  That would probably
be better.  Also maybe meds with count < 5.  You can even make the AE case-sensitive in
case that helps.

Sean

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, June 01, 2016 11:28 AM
To: dev@ctakes.apache.org
Subject: cTAKES false positives, case-insensitivity

Hi,

I have encountered false positives annotated with cTAKES that seem to come from case-insensitivity
of the annotation lookup, such as:

Pt uses hearing aids. -> "aids" is found as DiseaseDisorderMention cui=C0001175, Acquired
Immunodeficiency Syndrome

Pt values are all stable. -> "all" is found as DiseaseDisorderMention cui=C1961102, Precursor
Cell Lymphoblastic Leukemia Lymphoma"

Are there ways in cTAKES to approach or to resolve such issues?

How do you deal with such false positives, so that they are not matched?

Regards,
Tomasz

Mime
View raw message