ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download
Date Fri, 14 Nov 2014 14:39:36 GMT
Hi Andy,

Great stuff!  I think that I understand the method, but I have a question about the statement:

>the content is publicly available per the NCBI policy and license for MedGen sources

Does this mean that I, Joe Anybody, could download the content, place some of the content
in a database structured in my own fashion, package the -new- database, and include it in
a cTakes distribution?
Or, does it mean that content downloaded by script is usable as-is and only as-is?  The whole
"if I'd known your were going to do that I wouldn't have given it to you ..."

Thanks,
Sean

________________________________________
From: andy mcmurry [mcmurry.andy@gmail.com]
Sent: Thursday, November 13, 2014 6:59 PM
To: dev@ctakes.apache.org
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Pei: Yes, specifically:

The source code was released by Invitae under Apache ASL 2.0 per my request
and with full blessing from our legal counsel and software team. I also
reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and
formerly creative commons). This is legit, or I wouldn't have spent tons of
hours doing it.

The raw content is a set of scripts which wget a list of URLS from the NCBI
public FTP repositories. This code DOES NOT redistribute any content
whatsoever, just a list of URLs to download, unzip, and insert into a local
mysql database. To repeat: I am NOT circulating any content, just URL links
-- you must download the content yourself. And that is the beauty -- all
content is downloaded BY THE USER and the content is publicly available per
the NCBI policy and license for MedGen sources.


On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei <Pei.Chen@childrens.harvard.edu>
wrote:

> John- I believe that was the thinking.
> Andy- Just to confirm- Is the raw content of this dataset released under
> ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may
> re-tokenize it using the same PTB rules, format it for cTAKES' dictionary
> lookup, etc., and then redistribute it under the same License.
>
> > -----Original Message-----
> > From: John Green [mailto:john.travis.green@gmail.com]
> > Sent: Thursday, November 13, 2014 1:55 PM
> > To: dev@ctakes.apache.org
> > Cc: dev@ctakes.apache.org
> > Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
> > as open access download
> >
> > The old licensed setup would be kept as a packaged option? Much as it is
> > now.... With the unlicensed going out in place of the current "free"
> > dictionary? Am I understanding that right?
> >
> >
> > JG
> > —
> > Sent from Mailbox
> >
> > On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
> > <mcmurry.andy@gmail.com>
> > wrote:
> >
> > > I'll crunch the numbers -- in the meantime I can tell you that
> > > phenotypes vary by semantic type. clinical attributes  from SNOMED are
> > > abundant, many concepts in mesh that are mapped to diseases. Tons of
> > > "pharmacological substances"
> > > On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> > > Dmitriy.Dligach@childrens.harvard.edu> wrote:
> > >> Andy, thank you for this resource!
> > >>
> > >> Do you have an estimate of what percentage of UMLS concepts were left
> > out?
> > >>
> > >> Dima
> > >>
> > >>
> > >>
> > >>
> > >> On Nov 11, 2014, at 16:02, andy mcmurry <mcmurry.andy@gmail.com>
> > wrote:
> > >>
> > >> > Hello!
> > >> >
> > >> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> > >> >
> > >> > We just released a new library containing a huge chunk of UMLS
> > >> > concepts which are available without registering
> > accounts/username/passwords.
> > >> > LEGALLY. Yes, really!
> > >> >
> > >> > The subset is from NCBI and it contains *thousands of concepts from
> > >> SNOMED
> > >> > and other vocabularies*.
> > >> >
> > >> > The code is essentially
> > >> > 1. a list of WGET targets to various NCBI FTP site mirrors 2.
> > >> > Makefile for building the databases of interest
> > >> >
> > >> > Our legal team has approved distribution for Open Access work, ASL2
> > >> > LICENSE.
> > >> >
> > >> > I recommend we use this opportunity to make this the default
> > >> > distribution for CTAKES UMLS connections, because it obviates the
> > >> > need for so much painful credentialing and back and forth
> > >> > agreements with the US National Library of Medicine.
> > >> >
> > >> > Cheers!
> > >> > --Andy
> > >> >
> > >> >
> > >> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> > >> Masanz.James@mayo.edu>
> > >> > wrote:
> > >> >
> > >> >>
> > >> >> I would love to see the install be as simple as apt-get install
to
> > >> >> end
> > >> up
> > >> >> with some working dictionary that have more than a handful of
> > >> >> entries to get them started.
> > >> >>
> > >> >> Regards,
> > >> >> James Masanz
> > >> >>
> > >> >> -----Original Message-----
> > >> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> > >> >> Sent: Tuesday, September 09, 2014 4:32 PM
> > >> >> To: ctakes-dev@incubator.apache.org
> > >> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
> > >> >>
> > >> >> Greetings ctakes-dev:
> > >> >>
> > >> >> *UMLS license restrictions have been getting more lax over the
> > >> >> years -- *much of the UMLS can be downloaded directly from the
> > >> >> NCBI official FTP site.
> > >> >>
> > >> >> In fact, the NIH (and implicitly the NLM) *have already made the
> > >> standard
> > >> >> terms public for some medical specialities*.
> > >> >>
> > >> >> For example: Here is the UMLS subset specific to Medical Genetics
> > >> (MedGen)
> > >> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s)
> > >> >> and
> > >> names,
> > >> >> etc :
> > >> >>
> > >> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
> > >> >>
> > >> >> My team has developed a JVM based wrapper for MetaMap 2013AB
> > which
> > >> >> I intend to open source soon (Clojure).  It includes REST support
> > >> >> for invoking MetaMap with any or all of the command line arguments.
> > >> >> We do not integrate with UIMA, we are basically a wrapper around
> > >> >> the binary installation of MetaMap. The emphasis is on publication
> > >> >> text not clinical text, still, some services are common (such
as
> LVG).
> > >> >>
> > >> >> Strangely, the NLM still requires UMLS licenses to download
> > >> >> MetaMap execution binaries. The MetaMap binary install is better
> > >> >> but customizing dictionaries (DataFileBuilder) is not as easy
to
> > >> >> use as CTAKES with
> > >> YTEXT
> > >> >>
> > >> >> [
> > >> >> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installati
> > >> >> on
> > >> ]
> > >> >>
> > >> >> *** Hence, there is a real opportunity here to enable Apache
> > >> >> cTAKES to have a stronger default dictionary. ** *
> > >> >>
> > >> >> Imagine if we could
> > >> >> *$ apt-get install apache-ctakes *
> > >> >>
> > >> >> and instantly have a working package for SOME problem domain.
> > >> >> In my case (Medical Genetics) the UMLS definitions are already
> > >> >> available and the UMLS license problem becomes a non issue, at
> > >> >> least for many
> > >> first
> > >> >> time users
> > >> >>
> > >> >> Your thoughts?
> > >> >> AndyMC
> > >> >>
> > >>
> > >>
>

Mime
View raw message