incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <>
Subject Re: [VOTE] Apache cTAKES 3.0.0-incubating RC5 release
Date Wed, 23 Jan 2013 19:00:57 GMT
Hi folks,

I hope you'll excuse the entry of a rather junior newcomer, but it seems to
me that there might be some misunderstandings about the nature of the

In particular, unless I am mistaken, the 'binaries' in question are jars
filled with human-readable ASCII files. Developers are therefore free to
peruse these files to see how the model is put together, and could even
make modifications to the files if they desire (though as mentioned
earlier, this would be quite foolish).

The sticky point regarding the training data for the models is almost
certainly that the data consists of medical records protected by HIPAA. For
example, the Mayo data used for the sentence detector model includes Mayo
in-house programmatically de-identified patient records. This kind of data
is generally never released without a DUA -- I'm not familiar with any
major de-id clinical record datasets that are available without a DUA as
the liability (and, frankly, the moral concern) deriving from the risk of a
third party re-ID'ing the data is simply too great.

This being said, anybody who uses cTAKES must have a corpus that could be
used to train new models, since the use of cTAKES requires input data.
Developers who wish to contribute modifications therefore can test model
generation and use on their own data before contributing.

Problems remain with the issue of the performance of any contributed
modifications when training on the 'official' non-distributed datasets, and
it is true that contributors would not be able to test this a priori. I
imagine there are certainly committers with access to the datasets who
could provide feedback, but I suspect this issue is less a licensing issue
and more an issue with the nature of how cTAKES works. All users and
developers need to be cognizant of the applicability of the distributed
models to their own datasets, and I would bet that the models are not
highly performant on most other institutional medical record corpora, and
that any contributions to code involving the models would have a similar

Given this fact, it seems to me that maybe the models should just be
considered the same as, say, an image file distributed with code as a
placeholder. Users are free to replace the placeholder with something that
works for them, and the placeholder is not intended to be something that
will work for anyone or that would make it into any production distribution.

Hopefully at least some of this makes sense and his helpful!


Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical

On Wed, Jan 23, 2013 at 8:20 AM, Benson Margulies <>wrote:

> So, nothing derived from those undisclosable sources can be in the
> source package: period.
> As for the binaries, I am personally uncomfortable if you cannot even
> create a private download of those sources accessible to community
> members. However, I don't know how to translate my personal discomfort
> into policy. I will endeavour to get some advice.
> On Wed, Jan 23, 2013 at 10:36 AM, Masanz, James J.
> <> wrote:
> > One goal is to have a binary that contains all resources, which can be
> used to install cTAKES on a system that does not have an internet
> connection.
> > For now we can focus on a first Apache release that doesn't meet that
> goal, while pursuing the question with legal.
> > If legal says we can't do have that kind of binary here, then in the
> future we can consider if we will host such a binary on a different site.
> >
> > Regards,
> > James Masanz
> >
> >> -----Original Message-----
> >> From:
> >> [
> ]
> >> On Behalf Of Chris Douglas
> >> Sent: Wednesday, January 23, 2013 3:45 AM
> >> To:
> >> Cc:
> >> Subject: Re: [VOTE] Apache cTAKES 3.0.0-incubating RC5 release
> >>
> >> On Wed, Jan 23, 2013 at 12:47 AM, Jörn Kottmann <>
> >> wrote:
> >> > No, the OpenNLP did not have any discussion about it with legal. We
> >> > just came to the conclusion that its not worth spending time on these
> >> > issues, when we can instead produce our own training data which is
> >> > compatible with the Apache license.
> >>
> >> Understood. Are the compatible training data synthetic? Would you
> >> recommend a similar course here?
> >>
> >> James, is there a reason the models need to be distributed through
> Apache?
> >> Your time is your own, but going through legal could delay your
> release. -
> >> C
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> For additional commands, e-mail:
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message