ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: How to add a new dictionary database to cTAKES
Date Fri, 28 Feb 2014 15:00:25 GMT
Hi Abhishek,

You have some interesting timing ...
I can give you the xml specifications that you require if you send me the format of your dictionary.

Since you are new to the current dictionary module setup, I might also have a simpler solution
for you ...

A couple of days ago I checked a new module into Sandbox called ctakes-dictionary-lookup2
(how novel a name).  It is a complete replacement of the current dictionary lookup module,
but both can sit side-by-side in your local trunk sandbox or build.  It has an example descriptor
that tells it to read a bar-separated value file (BSV) as a dictionary, storing it (indexed)
in memory for fast lookup.  There is an example dictionary and xml descriptor for that dictionary.
 It accepts 2 or 3 column files in the format CUI|Text or CUI|TUI|Text.  It automatically
detects the number of columns, but they must be in that order.  It also does not need the
text fields to be tokenized, allowing it to accept "Tumor, malignant" as well as "tumor ,
malignant" as it will perform the tokenization upon reading the file.  
As the dictionary will be stored in-memory it should not be huge.  If you do have a very large
number of terms (>50k) then I recommend an hsql db.  The new module will take an hsql db
with the fixed field names CUI, TUI, RINDEX, TCOUNT, TEXT, RWORD.  I will explain what those
mean in some documentation that I plan to check into sandbox later today, but I can help you
build an hsql dictionary db ...
Yesterday I checked into sandbox a project named "dictionarytool".  It is source-only, but
I can give you a jar if you want one.  Out-of-the-box it will build various dictionaries from
a UMLS download.  It can build BSV, Hsql (new format) and Hsql (current format) to be used
by the new or current dictionary lookup modules.

This devlist announcement is a little premature on my part.  I will not get usage documentation
into sandbox for a day or two, but I can send you copies as I go if you are in a hurry, or
just give you xml snippets for the current module descriptors.  If you send the format of
your dictionary then that can be done quickly.  I just wanted to let you know that there is
another option wrt dictionary lookup.


-----Original Message-----
From: Abhishek De [mailto:abhishek.de@alumnux.com] 
Sent: Friday, February 28, 2014 6:58 AM
To: dev@ctakes.apache.org
Subject: How to add a new dictionary database to cTAKES



How do I add a new database to the cTAKES pipeline to perform lookup from? How do I specify
what columns to look up and how to annotate the text with the returned hits? I have gone through
the DictionaryLookupAnnotatorDB.xml and LookupDesc_Db.xml files. However, I could not understand
the meanings of the terms like "lookupField", "metaField", "maxPermutationLevel" and "exclusionTags".
If I add a new database, I need to configure this xml file properly. Please guide me regarding
these problems. 

Thanks and Regards, 

Abhishek De 
View raw message