ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: Ctakes to process 5000K recoreds
Date Tue, 09 Sep 2014 20:24:16 GMT
Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just sent you a secure
email/ftp link.  It contains a build of the new dictionary-lookup-fast module.  Should you
choose to try it, let me know how things turn out.

Sean
________________________________________
From: Nick Nikandish [snikandi@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 <node>ContextDependentTokenizerAnnotator</node>
 <node>DependencyParser</node>
 <node>AssertionAnnotator</node>

You might be able to get rid of the LvgAnnotator and still get decent results since variations
of word form should not affect medication names. I would try with it and without it on a smaller
set of files and see if you see a difference.

I believe the others are needed by the default configs for medication lookup. For example,
POS is used to get phrase type. Phrases are used to remove verb phrases from the lookup and
also therefore to keep the lookup windows from getting too big.  I'm more familiar with the
other types of named entities (diseases, symptoms, etc) than with medications.

-----Original Message-----
From: Nick Nikandish [mailto:snikandi@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that can return Medications
in DictionaryLookupAnnotator?
Thanks,
Nick

-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting processed, and
that is why it appears so fast. At least some of the annotators loop through the list of sections/segments,
which is why there is a simple segment annotator - so that there is at least one section/segment
identified. Are you getting any annotations at all?

-----Original Message-----
From: Nick Nikandish [mailto:snikandi@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses ctakes.....so
I cache the medication in DictionaryLookupAnnotator(in performLookup()) and use them in my
program but when I have SimpleSegementAnnotator it just takes forever. After taking SimpleSegementAnnotator
out, no medication name in DictionaryLookupAnnotator is returned in the code. So I was wondering
if there was a way that I could eliminate SimpleSegementAnnotator but still be  able to get
the medications name in that class?

Nick

-----Original Message-----
From: Pei Chen [mailto:chenpei@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the medication attributes
(i.e. dosage, frequency, etc.) are not being annotated?  I think the DrugNER needs a list
of section names in the config; I think it includes SIMPLE_SEGMENT.  I am very surprised that
SimpleSegementAnnotator is the bottle neck though; all it does is assume the entire document
is a single section called SIMPLE_SEGMENT.
Have you tried commenting out the DependencyParser if you're not using those features.

--Pei


On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <snikandi@emerginghealthit.com> wrote:
>
> Hi there,
>
> I am using Ctakes to process 5000K free text  records  where each record has several
medications.
> This is the fixed flow that it goes through:
>
>                                                                <node>SimpleSegmentAnnotator</node>
>                                                                 <node>SentenceDetectorAnnotator</node>
>                                                                 <node>TokenizerAnnotator</node>
>                                                                 <node>LvgAnnotator</node>
>                                                                 <node>ContextDependentTokenizerAnnotator</node>
>                                                                 <node>POSTagger</node>
>                                                                 <node>Chunker</node>
>                                                                 <node>LookupWindowAnnotator</node>
>                                                                 <node>DictionaryLookupAnnotatorDB</node>
>                                                                 <node>DependencyParser</node>
>                                                                 <node>AssertionAnnotator</node>
>
> <node>ExtractionPrepAnnotator</node>
>
> But it takes very very long time to process that many data( maybe a week or so) when
I use SimpleSegmentAnnotator.  By eliminating SimpleSegmentAnnotator the process is very fast
but no medication is being anotated.  Do you guys have any suggestion?
>
> Thanks,
> Nick
>

Mime
View raw message