ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guy Engelhard <g...@algotec.co.il>
Subject RE: Integrate Custom Dictionary in cTakes.
Date Tue, 14 Jun 2016 05:47:40 GMT
Hello Stuti,

Someone from my team investigated this and wrote the following manual for me. Unfortunately
she has already left and I don't have any more details to give you. We didn't continue with
populating new dictionaries yet. Still on our todo list.

Using customized dictionaries:

1.       Use UMLS MetamorphoSys to extract a customized subset from the UMLS DB (e.g. SNOMEDCT,
NCI) and then use ctakes dictionary tool to construct an HSQL DB based on the extracted UMLS

a.       Note that The size of the dictionary influences on the runtime
Also she wrote the following on upgrading the 2011AB UMLS that comes with ctakes to 2015AA
dictionary (I think it involves constructing this HSQL DB that is needed). Perhaps you can
use this to piece together what needs to be done:

Currently, UMLS 2011AB is the only UMLS dictionary that is available as a ctakes-compatible
HSQL DB. It can be downloaded from


and is placed at:

<CTAKES HOME>\\resources\org\apache\ctakes\dictionary\lookup\umls2011ab<file:///\\resources\org\apache\ctakes\dictionary\lookup\umls2011ab>

e.g. E:\Program Files\apache-ctakes-3.2.2-rc2\resources\org\apache\ctakes\dictionary\lookup\umls2011ab

To generate ctakes-compatible dictionaries that are based on a newer UMLS version (e.g. UMLS
2015AB) or on a specific subset (e.g. only the SNOMEDCT source), use apache-ctakes-dictionary-tool,
a package that is available as a project in my Eclipse workspace.


To generate an updated ctakes-dictionary with only terms from SNOMEDCT source, I did the following:

1.       Extract a SNOMEDCT subset from the latest UMLS version (UMLS_2015AA) (another email,
"UMLS Metamorphosys subset creation", describes how to do this) and save the output at:


2.       Create an empty cTAKES HSQL database. This can be done as follows:

a.       Copy umls2011ab folder (\\E:\apache-ctakes-3.2.2-rc2\resources\org\apache\ctakes\dictionary\lookup\umls2011ab<file:///E:\apache-ctakes-3.2.2-rc2\resources\org\apache\ctakes\dictionary\lookup\umls2011ab>)

as a new folder (e.g. \\E:\NLP\Data\UMLS\umls_scratch<file:///E:\NLP\Data\UMLS\umls_scratch>).

b.      Change the directory umls_scratch and all its sub-directories to not be read-only,
through the properties of the directory. Also open umls_scratch\ umls.properties as a text
file and change "readonly" to false.

c.       Run HSQL manager as administrator (runManagerSwing.bat as administrator from \\E:\hsqldb-2.3.3\hsqldb\bin<file:///E:\hsqldb-2.3.3\hsqldb\bin>)

d.      In the Connect Window, choose "HSQL Database Engine Standalone" and set the following
attributes for the other fields:

-          Driver: org.hsqldb.jdbcDriver

-          URL: jdbc:hsqldb:file:  E:\NLP\Data\UMLS\umls_scratch\umls<file:///\\ORANITDR7\Data\ctakes\umls_scratch\umls>

-          User: SA (leave the password field empty)

e.      Delete the content of the UMLS_MS_2011AB table by executing the following SQL command:


f.        Exit from the HSQL manager.

g.       Now copy the umls_scratch directory as a new directory named umls_2015aa _snomedct
(e.g. \\E:\NLP\Data\UMLS\umls_2015aa_snomedct<file:///E:\NLP\Data\UMLS\umls_2015aa_snomedct>).

h.      In the future you can use copies of the emptied umls_scratch directory whenever needed.

3.       Add apache-ctakes-dictionarytool to Eclipse as a new Java project (File --> New
--> Java Project)



5.       Copy the two files sources.txt and TUIs.txt into \\E:\NLP\Data\UMLS\umls_2015aa_snomedct<file:///E:\NLP\Data\UMLS\umls_2015aa_snomedct>.

6.       From Eclipe Run the DictionaryCreator (umls_ms) application of apache-ctakes-dictionary-tool
with the following arguments (apache-ctakes-dictionary-tool --> src --> org.apache.ctakes.dictionarytool
--> DictionaryCreator.java --> Run As --> Run Configurations --> Arguments):

-umls     \\ COMPUTER \Data\UMLS\2015AA_snomedct\2015AA\META<file:///\\ORANITDR7\Data\UMLS\2015AA_snomedct\2015AA\META>

-db         jdbc:hsqldb:file:\\COMPUTER\Data\ctakes\umls_2015aa_snomedct\hsql\umls

-tbl         UMLS_MS_2011AB

-tui         \\ COMPUTER \Data\ctakes\umls_2015aa_snomedct\TUIs.txt<file:///\\ORANITDR7\Data\ctakes\umls_2015aa_snomedct\TUIs.txt>

-src         \\ COMPUTER \Data\ctakes\umls_2015aa_snomedct\sources.txt<file:///\\ORANITDR7\Data\ctakes\umls_2015aa_snomedct\sources.txt>


7.       If you get an error, move everything in E:\NLP\Data\UMLS\umls_2015aa_snomedct other
than sources.txt and TUIs.txt into the directory hsql, or else try removing the directory
hsql from the path given in the -db argument.

8.       Run HSQL manager but this time connect to umls_2015aa_snomedct (jdbc:hsqldb:file:\\
COMPUTER \Data\ctakes\umls_2015aa_snomedct\hsql\umls) to check that the UMLS_2011AB table
has been populated correctly.

From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Tuesday, June 14, 2016 8:20 AM
To: 'user@ctakes.apache.org'
Subject: RE: Integrate Custom Dictionary in cTakes.

Hello Everyone,
I'm waiting for some response, even some pointers will be helpful.

Thanks &Regards
Stuti Awasthi

From: Stuti Awasthi
Sent: Monday, June 13, 2016 4:24 PM
To: user@ctakes.apache.org<mailto:user@ctakes.apache.org>
Subject: Integrate Custom Dictionary in cTakes.

Hi All,
Im using cTakes 3.2 and would like to use custom dictionary in place of UMLS to run few trials.
Now in documentation I got the information that our new dictionary needs to be in BSV or hsql
format but didn't got more details on the same.
I need some help to

·         convert my custom dictionary to bsv format (| separated).

·         How to integrate the new custom dictionary to cTakes. Which configuration files
needs to be modified to include my custom dictionary in cTakes.

My present dictionary looks like :
CUI         EnglishPreferredName
C1548760        Risk Codes - Aggressive
C1548761        Biohazard - Risk Codes

Thanks in advance.

Stuti Awasthi

The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e
mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator
or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may
not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying,
disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized
representative of
HCL is strictly prohibited. If you have received this email in error please delete it and
notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

View raw message