ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ephi <eph...@gmail.com>
Subject Re: Dependency Parser model data
Date Tue, 24 Mar 2015 16:29:41 GMT
After some more research, it seems that you can run the CPE to train a
model in version 2.0 of cTAKES but this doesn't work in cTAKES 3.2.

Many files are missing, but more importantly it seems that the model format
has changed. In version 2.0 the model format was the basic liblinear/libsvm
format. In 3.2, the format is the slightly modified clearNLP format which
uses strings for features and labels instead of numbers.

So, assuming that we want to train our own data, we would like to be able
to convert a .dep file into a feature file in the clearNLP format, which we
could then convert to the liblinear format, trained in liblinear, and the
resulting model could be converted back to the clearNLP format and passed
to the predictor.

So is this possible? Is there a way to translate a dependency tree file
into a clearNLP file?

Thanks, Ephi










On Mon, Mar 16, 2015 at 6:46 AM, Pei Chen <chenpei@apache.org> wrote:

> Ephi,
> The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
> contain much more.  They should contain at least MiPACQ and SHARP training
> data.  Could you point us to the documentation so we can update it?  I
> believe the break down was:
>
>
>    - Clinical questions: 1,600 sentences, 30,138 tokens.
>    - Medpedia articles: 2,796 sentences, 49,922 tokens.
>    - MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
>    - MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
>    - Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
>    - Seattle group health pathological notes: 2,294 sentences, 34,384
>    tokens.
>    - SHARP clinical notes: 6,787 sentences, 94,205 tokens.
>    - SHARP stratified: 4,316 sentences, 43,037 tokens.
>    - SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
>    - TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
>    - TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.
>
> There are some discussions on appending/augmenting the existing
> annotated/training data[2].  I think the short answer is that there is
> currently no easy way short of having to sign DUA's from every single
> source institution.
>
> [1] http://svn.apache.org/r1465043
> [2]
>
> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3CE5A9FA5ABBF1CA4085D4F0794852A51E2424117D@CHEXMBX3A.CHBOSTON.ORG%3E
>
>
> On Sun, Mar 15, 2015 at 11:58 AM, Ephi <ephi79@gmail.com> wrote:
>
> > Hi -
> >
> > From the documentation, the data used to train the dep parser in cTAKES
> > seems to be 1600 clinical questions (from the Mayo clinic?).
> >
> > Is there a way to retrieve this data in order to retrain the model (while
> > adding on additional data) ?
> >
> > Thanks!
> > Ephi
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message