ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Bates <jonrba...@gmail.com>
Subject Re: cTakes Scalability Problem
Date Wed, 02 Jul 2014 14:50:11 GMT
Disclaimer: I'm not a developer, just a user.   To use DBConsumer, I had to
change "int" to "bigint" for the anno_base_id in the SQL tables to prevent
overflow in the annotation index.  Speed did increase marginally after the
change, but I don't understand how the change in datatype could have been
the cause...  Let us know how it works out!

On Tue, Jul 1, 2014 at 11:50 AM, Prasanna Bala <balkiprasanna1984@gmail.com>

> Hi,
> Thanks for your suggestions. So I have to change the "int" to "bigint" to
> improve the performance.
> I am looking at UIMA DUCC.
> http://uima.apache.org/doc-uimaducc-whatitam.html
> The problem with Hadoop is it runs in batch process. So it cannot be used
> for low latency real systems. But still I want to explore it.
> On Tue, Jul 1, 2014 at 6:20 PM, Jonathan Bates <jonrbates@gmail.com>
> wrote:
>> Hi Prasanna,
>> I am currently using 3.1.2 to process ~40M notes using 14 CPEs with
>> AggregatePlaintextUMLSProcessor+DBConsumer.  So far, ~34M notes have been
>> annotated and stored.  Altogether, I'm seeing 0.054sec/note.  This is with
>> 4.1k rows in v_snomed_fword_lookup.  One modification we had to make was to
>> change anno_base_id datatype from 'int' to 'bigint'.  It would be very
>> interesting to see Hadoop used with ctakes...
>> -Jon
>> On Tue, Jul 1, 2014 at 1:54 AM, Prasanna Bala <
>> balkiprasanna1984@gmail.com> wrote:
>>> Hi,
>>> I have certain clarifications. This is regarding using third party
>>> libraries with cTakes. I have clarifications on run time for processing
>>> documents using cTakes. We are able to run the cTakes through batch mode.
>>> But we have plans to run documents for 1 million clinical documents. Can
>>> anyone tell me if they have tackled scalability using cTakes ? I have an
>>> idea to distribute the process using Hadoop. There are various libraries
>>> available that can use UIMA and distribute the process using Hadoop. Since
>>> cTakes is also developed using UIMA, I think there should be a way to
>>> distribute process. Have anyone tried this ? Are there any limitations in
>>> distributing problems using cTakes ? Your thoughts please ?
>>> Regards,
>>> Prasanna

View raw message