ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pei Chen <chen...@apache.org>
Subject Re: Dependency Parser model data
Date Mon, 16 Mar 2015 04:46:26 GMT
Ephi,
The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
contain much more.  They should contain at least MiPACQ and SHARP training
data.  Could you point us to the documentation so we can update it?  I
believe the break down was:


   - Clinical questions: 1,600 sentences, 30,138 tokens.
   - Medpedia articles: 2,796 sentences, 49,922 tokens.
   - MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
   - MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
   - Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
   - Seattle group health pathological notes: 2,294 sentences, 34,384
   tokens.
   - SHARP clinical notes: 6,787 sentences, 94,205 tokens.
   - SHARP stratified: 4,316 sentences, 43,037 tokens.
   - SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
   - TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
   - TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.

There are some discussions on appending/augmenting the existing
annotated/training data[2].  I think the short answer is that there is
currently no easy way short of having to sign DUA's from every single
source institution.

[1] http://svn.apache.org/r1465043
[2]
http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3CE5A9FA5ABBF1CA4085D4F0794852A51E2424117D@CHEXMBX3A.CHBOSTON.ORG%3E


On Sun, Mar 15, 2015 at 11:58 AM, Ephi <ephi79@gmail.com> wrote:

> Hi -
>
> From the documentation, the data used to train the dep parser in cTAKES
> seems to be 1600 clinical questions (from the Mayo clinic?).
>
> Is there a way to retrieve this data in order to retrain the model (while
> adding on additional data) ?
>
> Thanks!
> Ephi
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message