ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AndyMC@apache.org (forwarding)" <mcmurry.a...@gmail.com>
Subject building a *real sample dictionary* without UMLS login
Date Fri, 02 Oct 2015 07:43:30 GMT
Greetings ctakes-dev! 

I have been polishing MedGen (UMLS) dictionaries for over a year now and I am confident in
saying "this is solid". 
As a reminder, the medgen-mysql package contains a large subset of the UMLS that can be downloaded
without UMLS login, greatly simplifying the creation of an example dictionary. 

QUESTION: 
Would you like me to integrate this into ctakes to simplify installations for new-users, and
if so, what would be your preferred method?

Source Vocabularies (SAB)
+-------------+--------+
| SourceVocab | cnt    | 
+-------------+--------+
| MSH         | 245435 | Medical Subject Headings
| SNOMEDCT_US | 156105 | SNOMED Clinical Terms
| NCI         | 136888 | NCI Cancer Terms
| ...         |  ...   | 
+-------------+--------+

Semantic Types (STY)
+-------------------------------------------+--------+
| SemanticType                              | cnt    |
+-------------------------------------------+--------+
| Pharmacologic Substance                   | 102511 |
| Finding                                   |  90413 |
| Organic Chemical                          |  81329 |
| Disease or Syndrome                       |  47223 |
| Neoplastic Process                        |  16151 |
| Amino Acid, Peptide, or Protein           |   9383 |
| Congenital Abnormality                    |   6536 |
| Pathologic Function                       |   5655 |
| Steroid                                   |   3919 |
| Sign or Symptom                           |   2909 |
| ...                                       |   ...  |


What would you like to see?
AndyMC@apache.org	


On Nov 12, 2014, at 6:14 AM, "Dligach, Dmitriy" <Dmitriy.Dligach@childrens.harvard.edu>
wrote:

> Andy, thank you for this resource!
> 
> Do you have an estimate of what percentage of UMLS concepts were left out?
> 
> Dima
> 
> 
> 
> 
> On Nov 11, 2014, at 16:02, andy mcmurry <mcmurry.andy@gmail.com> wrote:
> 
>> Hello!
>> 
>> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>> 
>> We just released a new library containing a huge chunk of UMLS concepts
>> which are available without registering accounts/username/passwords.
>> LEGALLY. Yes, really!
>> 
>> The subset is from NCBI and it contains *thousands of concepts from SNOMED
>> and other vocabularies*.
>> 
>> The code is essentially
>> 1. a list of WGET targets to various NCBI FTP site mirrors
>> 2. Makefile for building the databases of interest
>> 
>> Our legal team has approved distribution for Open Access work, ASL2
>> LICENSE.
>> 
>> I recommend we use this opportunity to make this the default distribution
>> for CTAKES UMLS connections, because it obviates the need for so much
>> painful credentialing and back and forth agreements with the US National
>> Library of Medicine.
>> 
>> Cheers!
>> --Andy
>> 
>> 
>> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Masanz.James@mayo.edu>
>> wrote:
>> 
>>> 
>>> I would love to see the install be as simple as apt-get install to end up
>>> with some working dictionary that have more than a handful of entries to
>>> get them started.
>>> 
>>> Regards,
>>> James Masanz
>>> 
>>> -----Original Message-----
>>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>> To: ctakes-dev@incubator.apache.org
>>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>>> 
>>> Greetings ctakes-dev:
>>> 
>>> *UMLS license restrictions have been getting more lax over the years --
>>> *much of the UMLS can be downloaded directly from the NCBI official FTP
>>> site.
>>> 
>>> In fact, the NIH (and implicitly the NLM) *have already made the standard
>>> terms public for some medical specialities*.
>>> 
>>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
>>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
>>> etc :
>>> 
>>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>>> 
>>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>>> intend to open source soon (Clojure).  It includes REST support for
>>> invoking MetaMap with any or all of the command line arguments.
>>> We do not integrate with UIMA, we are basically a wrapper around the
>>> binary installation of MetaMap. The emphasis is on publication text not
>>> clinical text, still, some services are common (such as LVG).
>>> 
>>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>>> execution binaries. The MetaMap binary install is better but customizing
>>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>>> 
>>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>>> 
>>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>>> have a stronger default dictionary. ** *
>>> 
>>> Imagine if we could
>>> *$ apt-get install apache-ctakes *
>>> 
>>> and instantly have a working package for SOME problem domain.
>>> In my case (Medical Genetics) the UMLS definitions are already available
>>> and the UMLS license problem becomes a non issue, at least for many first
>>> time users
>>> 
>>> Your thoughts?
>>> AndyMC
>>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message