ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Nikandish <snika...@emerginghealthit.com>
Subject RE: Bacterium Dictionary
Date Mon, 30 Jun 2014 20:30:36 GMT
Many thanks Sean. This is very useful. I will follow the instruction and create the dictionary.

Thanks,
Nick

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Monday, June 30, 2014 4:27 PM
To: dev@ctakes.apache.org
Subject: RE: Bacterium Dictionary

Hi Nick,

I'm pasting (below) from a howto.txt  in the dictionarytool/doc/ directory.

You will want to do the following:
1.  Download / Install the UMLS dictionary source from NIH  (takes a while) 2.  Create a file
named "bacterium.tui" containing the single line "T007"
3.  Decide what ctakes dictionary module to use, the default or newer.  Using the default
may be faster for you.
4.  Build the dictionary creator tool.  I can send a prebuilt jar if you have problems.
5.  java -cp DictionaryTool.jar org.apache.ctakes.dictionarytool.DictionaryCreator -fw -umls
pathToUmlsRoot -tui pathToBacterium.tui -ol sanityCheck.bsv

After running #5 with the path to your umls installation and file with "T007" you should have
a bar-separated-value file named sanityCheck.bsv containing all the bacteria entry CUIs and
Text.  If it looks good, then you can use it directly or create a hsql database:
1. copy resource/cachedbtemplate/* to yourDbLocation/  You can also use the memdbtemplate.
2.  rename the * template files to suit your need (nick_bacteria) 3. .  java -cp DictionaryTool.jar
org.apache.ctakes.dictionarytool.DictionaryCreator -fw -umls pathToUmlsRoot -tui pathToBacterium.tui
-db nick_bacteria -tbl nick_bacteria

Gotta run,
Sean



>java -cp DictionaryTool.jar 
>org.apache.ctakes.dictionarytool.DictionaryCreator

Dictionary Creator: Creates a flat file Cui|Text or Database Dictionary from UMLS and Orangebook
Database Dictionary can be indexed by each Text's First Word or Rarest Word (for the dictionary)
Minimal Usage: DictionaryCreator -umls pathToUmlsRoot -ol pathToFlatFileOutput

-fw             Create First Word Index
-umls           Umls Root Directory
-ob             Orangebook Path
-fd             Format Data Directory
-tui            Input Tui List Path
-src            Source Type List Path
-ol             Output Cui and Term List Path
-db             Output Database Url
-tbl            Output Database Table

The UMLS Root Directory must be specified One form of output must be specified using either
-ol or -db and -tbl The default index type for databases is Rare Word Index If an Orangebook
Path is not specified then (orangebook) medication terms are not written If a Format Data
Directory is not specified then the default is used: ./data/default If an Input Tui List Path
is not specified then the cTakes Tuis are used: ./data/default/CtakesAllTuis.txt If a Source
Type List Path is not specified then Snomed is used: ./data/default/CtakesSources.txt

Important: Dictionary entries are appended to the output file or database.  
Running the same command twice will result in a database with all terms existing twice.

The data/default/ directory does include non-default possibilities, such as files listing
only single cTakes groups:
e.g. CtakesAnatTuis.txt
and all UMLS groups:
UmlsAllTuis.txt
that can be used with the option -tui ./data/default/UmlsAlltuis.txt

There is also a file with all UMLS sources:
UmlsAllSources.txt
that can be used with the option -src ./data/default/UmlsAllSources.txt

Remember that if you want to output to a database you must specify both the url and table
name:
-db jdbc:hsqldb:file:pathToMyDatabase -tbl myTableName

Also remember that hsqldb requires the entire url to be lowercase.

"Format Data" refers to the data that is used to format the end-result dictionary by trimming
or expanding the umls entries.
It is recommended that the defaults are used, but you are welcome to experiment with your
own.


If you are unfamiliar with hsqldb, there are two template / starting point databases in the
resource/ directory.
cacheddbtemplate/ contains a template for a disk-cached dictionary, and memdbtemplate one
for a fully in-memory dictionary.
Using an in-memory dictionary is orders of magnitude faster than using a disk-cached, but
not a good idea for very large (.5GB?) databases.


There are a few other toys that can be found by perusing the source, such as a tool that creates
a mapping of codes for like terms in different dictionaries:
ICD10|ICD9|RXNORM|SNOMEDCT
Usage: java -cp DictionaryTool.jar org.apache.ctakes.dictionarytool.CodeMapCreator -umls pathToUmlsRoot
-ol pathToFlatFileOutput

Some of these extra utilities may be experimental or unfinished, so user beware.



At this time the code could use some javadocs and unit tests, plus a little cleanup.  I'm
very busy, so volunteer works is appreciated.

Enjoy


> -----Original Message-----
> From: Nick Nikandish [mailto:snikandi@emerginghealthit.com]
> Sent: Monday, June 30, 2014 3:50 PM
> To: dev@ctakes.apache.org
> Subject: RE: Bacterium Dictionary
> 
> Hi Sean,
> 
> Thanks for the info. I have written an application for clinical text 
> using Ctakes where one of the  annotator  retrieves and identifies the 
> bacterium in clinical texts but it uses a small library that I 
> created. Therefore I would like to check those texts against a 
> comprehensive library like UML. I have UMLS account and but I was 
> wondering how to utilize Ctakes to use that library. It will be great 
> if there were some documents on building a separate dictionary using the dictionary creator.
> 
> 
> Thanks again,
> Nick
> 
> -----Original Message-----
> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
> Sent: Monday, June 30, 2014 3:37 PM
> To: dev@ctakes.apache.org
> Subject: RE: Bacterium Dictionary
> 
> Hi Nick,
> There are ~26,000 T007 Bacterium (falls under Living Being) entries in 
> UMLS 2013aa.  They aren't in the cTakes dictionary, but you can build 
> a separate bacteria dictionary using the dictionary creator tool in 
> cTakes sandbox.  It can create dictionaries formatted for use with 
> both available cTakes-dictionary- lookup modules.  I have a full 
> living beings dictionary, if you want to somehow confirm your umls license then I could
pull out the bacteria for you.
> Sean
> 
> > -----Original Message-----
> > From: Pei Chen [mailto:chenpei@apache.org]
> > Sent: Monday, June 30, 2014 12:50 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Bacterium Dictionary
> >
> > Nick,
> > I am not sure how complete it is, but I believe the UMLS has the 
> > semantic type of
> >
> > Bacterium
> > <https://uts.nlm.nih.gov//semanticnetwork.html#Bacterium;0;0;2014AA>
> >
> >  [T007]
> >   It's most likely not included in the default cTAKES dictionaries though...
> >
> > Thanks,
> > Pei
> >
> >
> > On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish < 
> > snikandi@emerginghealthit.com> wrote:
> >
> > >  Hi there,
> > >
> > >
> > >
> > > I was wondering if Ctakes has any Bacterium Dictionary? I need to 
> > > extract information for bacteria like “Enterococcus Faecium”, 
> > > “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do 
> > > it by using Ctakes annotators?
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > *Nick Nikandish*
> > >
> > > *Product Development Software Engineer*
> > >
> > > Clinical Research Informatics
> > >
> > >
> > >
> > > *Emerging Health*
> > >
> > > *Montefiore Information Technology*
> > >
> > > 6 Executive Blvd. Suite 290, Yonkers, NY 10701
> > >
> > > 914-457-6792 Office
> > >
> > > snikandi@montefiore.org
> > >
> > > www.emerginghealthit.com
> > >
> > > www.montefiore.org
> > >
> > >
> > >
> > > [image: logo-montefiore-it]
> > >
> > >
> > >
Mime
View raw message