ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: Ctakes to process 5000K recoreds
Date Tue, 09 Sep 2014 21:27:58 GMT
Yes, the code is in the sandbox.  
________________________________________
From: Chen, Pei [Pei.Chen@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 5:26 PM
To: <dev@ctakes.apache.org>
Subject: Re: Ctakes to process 5000K recoreds

Sean-
Aren't the scripts to generate the DB already available in the sandbox area?

Sent from my iPhone

> On Sep 9, 2014, at 5:24 PM, "Finan, Sean" <Sean.Finan@childrens.harvard.edu> wrote:
>
> There is a tool to generate a dictionary in the new format using the UMLS MR*** files.
>
> The module can also read directly from a file with bar-separated values:  CUI|Text or
CUI|TUI|Text which could be useful for small custom dictionaries.
>
> I can send a copy of the dictionary creator jar and instructions tomorrow.
>
> Sean
> ________________________________________
> From: Bruce Tietjen [bruce.tietjen@perfectsearchcorp.com]
> Sent: Tuesday, September 09, 2014 5:17 PM
> To: dev@ctakes.apache.org
> Subject: Re: Ctakes to process 5000K recoreds
>
> Sean,
>
> If that is a script for generating a dictionary for use with
> dictionary-lookup-fast, I would also be very interested in checking it out.
>
> Thanks,
>
> Bruce
>
>
> [image: IMAT Solutions] <http://imatsolutions.com>
> Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tietjen@imatsolutions.com
>
> On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
> snikandi@emerginghealthit.com> wrote:
>
>> Great. I will do that. Thanks again.
>>
>> Nick
>>
>> -----Original Message-----
>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:39 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Just use it with cTakes.  Instead of removing other modules from the
>> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>>
>> For the
>> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
>> , you would modify:
>>
>>    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>      <import
>> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>>    </delegateAnalysisEngine>
>>
>> To be:
>>
>>    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>      <import
>> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>>    </delegateAnalysisEngine>
>>
>>
>> That should be it.  You can then leave the rest of the module
>> specifications alone.
>>
>> Sean
>>
>> ________________________________________
>> From: Nick Nikandish [snikandi@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Hi Sean,
>>
>> Many thanks, I will try it tomorrow. Do you have any special instruction
>> to run that scrip or I have to use it with cTakes?
>>
>> Thanks,
>> Nick
>>
>> -----Original Message-----
>> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:24 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Hi Nick,
>>
>> I think that the bottleneck is probably the lookup module itself.  So, I
>> just sent you a secure email/ftp link.  It contains a build of the new
>> dictionary-lookup-fast module.  Should you choose to try it, let me know
>> how things turn out.
>>
>> Sean
>> ________________________________________
>> From: Nick Nikandish [snikandi@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:10 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Thanks, let me try it.
>> Nick
>>
>> -----Original Message-----
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Tuesday, September 09, 2014 4:08 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> If you just need the medication names, you can remove these:
>> <node>ContextDependentTokenizerAnnotator</node>
>> <node>DependencyParser</node>
>> <node>AssertionAnnotator</node>
>>
>> You might be able to get rid of the LvgAnnotator and still get decent
>> results since variations of word form should not affect medication names. I
>> would try with it and without it on a smaller set of files and see if you
>> see a difference.
>>
>> I believe the others are needed by the default configs for medication
>> lookup. For example, POS is used to get phrase type. Phrases are used to
>> remove verb phrases from the lookup and also therefore to keep the lookup
>> windows from getting too big.  I'm more familiar with the other types of
>> named entities (diseases, symptoms, etc) than with medications.
>>
>> -----Original Message-----
>> From: Nick Nikandish [mailto:snikandi@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 3:01 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> James,
>>
>> Do you have any suggestion about running cTakes with minimum annotators
>> that can return Medications in DictionaryLookupAnnotator?
>> Thanks,
>> Nick
>>
>> -----Original Message-----
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Tuesday, September 09, 2014 3:05 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> I suspect that when you take out simple segment annotated, nothing is
>> getting processed, and that is why it appears so fast. At least some of the
>> annotators loop through the list of sections/segments, which is why there
>> is a simple segment annotator - so that there is at least one
>> section/segment identified. Are you getting any annotations at all?
>>
>> -----Original Message-----
>> From: Nick Nikandish [mailto:snikandi@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 2:02 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Pei,
>> I need the name of the medications for the application that I wrote and
>> uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in
>> performLookup()) and use them in my program but when I have
>> SimpleSegementAnnotator it just takes forever. After taking
>> SimpleSegementAnnotator out, no medication name in
>> DictionaryLookupAnnotator is returned in the code. So I was wondering if
>> there was a way that I could eliminate SimpleSegementAnnotator but still
>> be  able to get the medications name in that class?
>>
>> Nick
>>
>> -----Original Message-----
>> From: Pei Chen [mailto:chenpei@apache.org]
>> Sent: Tuesday, September 09, 2014 2:54 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Ctakes to process 5000K recoreds
>>
>> Nick,
>> When you mean no medication is being annotated, I presume you mean the
>> medication attributes (i.e. dosage, frequency, etc.) are not being
>> annotated?  I think the DrugNER needs a list of section names in the
>> config; I think it includes SIMPLE_SEGMENT.  I am very surprised that
>> SimpleSegementAnnotator is the bottle neck though; all it does is assume
>> the entire document is a single section called SIMPLE_SEGMENT.
>> Have you tried commenting out the DependencyParser if you're not using
>> those features.
>>
>> --Pei
>>
>>
>> On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <
>> snikandi@emerginghealthit.com> wrote:
>>>
>>> Hi there,
>>>
>>> I am using Ctakes to process 5000K free text  records  where each record
>> has several medications.
>>> This is the fixed flow that it goes through:
>> <node>SimpleSegmentAnnotator</node>
>> <node>SentenceDetectorAnnotator</node>
>> <node>TokenizerAnnotator</node>
>> <node>LvgAnnotator</node>
>> <node>ContextDependentTokenizerAnnotator</node>
>> <node>POSTagger</node>
>> <node>Chunker</node>
>> <node>LookupWindowAnnotator</node>
>> <node>DictionaryLookupAnnotatorDB</node>
>> <node>DependencyParser</node>
>> <node>AssertionAnnotator</node>
>>>
>>> <node>ExtractionPrepAnnotator</node>
>>>
>>> But it takes very very long time to process that many data( maybe a week
>> or so) when I use SimpleSegmentAnnotator.  By eliminating
>> SimpleSegmentAnnotator the process is very fast but no medication is being
>> anotated.  Do you guys have any suggestion?
>>>
>>> Thanks,
>>> Nick
>>

Mime
View raw message