ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ephi <eph...@gmail.com>
Subject Re: Dependency Parser model data
Date Tue, 24 Mar 2015 09:49:34 GMT
Thanks!

***** 1 ******
Regarding the documentation - the documentation for cTAKES 3.2 [1] links to
the Dependency Parser documentation for 3.0 [2], it doesn't seem to have an
updated documentation for this component.

In the page from 3.0 it says simply that clinques.mod is the main
ClearParser model packaged with cTAKES v1.1 and that it is trained on a
corpus of 1600 clinical questions.

***** 2 ******
Regarding self training of the models - I tried following the documentation
but didn't succeed. The documentation [2] states the following:

1. Download and install the C++ version of liblinear from National Taiwan
University; this requires much less memory than the default Java version.
2.Train a model
To create a model using cTAKES POS tags and lemmas with Eclipse:
1. Create a <your-data>.min file from <your-data>.dep (see the section
called "Conversion between formats")
2. Use the UIMA_CPE_GUI---dependency parser launch.
3. Load desc/collection_processing_engine/ClearTrainerPosLemTestCPE.xml
4. Put your filename under "Dependency File"
5. Make sure "Training Mode" is checked
6. Rename the "Dependency Model File" and "Lexicon Directory" according to
what you want.
7. Make sure "Trainer Path" is a valid relative path from
>cTAKES_HOME>/dependency parser to a vaid liblinear binary train file.


Regarding step 2 - cTAKES 3.2 doesn't seem to have the UIMA_CPE_GUI, there
is only bin/runCPE.bat. I tried running this.

Regarding step 3 -
When I tried to load
desc\ctakes-dependency-parser\desc\collection_processing_engine\ClearTrainerPosLemTestCPE.xml
I got an error (snapshot attached)

Any ideas?

Thanks, Ephi

[1]
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Component+Use+Guide
[2]
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Dependency+Parser+and+Semantic+Role+Labeler

On Mon, Mar 16, 2015 at 6:46 AM, Pei Chen <chenpei@apache.org> wrote:

> Ephi,
> The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
> contain much more.  They should contain at least MiPACQ and SHARP training
> data.  Could you point us to the documentation so we can update it?  I
> believe the break down was:
>
>
>    - Clinical questions: 1,600 sentences, 30,138 tokens.
>    - Medpedia articles: 2,796 sentences, 49,922 tokens.
>    - MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
>    - MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
>    - Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
>    - Seattle group health pathological notes: 2,294 sentences, 34,384
>    tokens.
>    - SHARP clinical notes: 6,787 sentences, 94,205 tokens.
>    - SHARP stratified: 4,316 sentences, 43,037 tokens.
>    - SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
>    - TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
>    - TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.
>
> There are some discussions on appending/augmenting the existing
> annotated/training data[2].  I think the short answer is that there is
> currently no easy way short of having to sign DUA's from every single
> source institution.
>
> [1] http://svn.apache.org/r1465043
> [2]
>
> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3CE5A9FA5ABBF1CA4085D4F0794852A51E2424117D@CHEXMBX3A.CHBOSTON.ORG%3E
>
>
> On Sun, Mar 15, 2015 at 11:58 AM, Ephi <ephi79@gmail.com> wrote:
>
> > Hi -
> >
> > From the documentation, the data used to train the dep parser in cTAKES
> > seems to be 1600 clinical questions (from the Mayo clinic?).
> >
> > Is there a way to retrieve this data in order to retrain the model (while
> > adding on additional data) ?
> >
> > Thanks!
> > Ephi
> >
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message