ctakes-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Selina Chu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CTAKES-374) Scale out of cTAKES pipeline
Date Tue, 18 Aug 2015 18:10:45 GMT

     [ https://issues.apache.org/jira/browse/CTAKES-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Selina Chu updated CTAKES-374:
    Summary: Scale out of cTAKES pipeline  (was: Scale out of cTAKES pipeline. Finding better
ways to allow cTAKES to be easily run in a distributed fashion.)

> Scale out of cTAKES pipeline
> ----------------------------
>                 Key: CTAKES-374
>                 URL: https://issues.apache.org/jira/browse/CTAKES-374
>             Project: cTAKES
>          Issue Type: New Feature
>    Affects Versions: future enhancement
>            Reporter: Selina Chu
>             Fix For: 3.2.1
> Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA components
aren't serializable (and thus cTAKES' components as well).  Would like to come up with better
ways to allow cTAKES to be easily run in a distributed fashion.
> For example, for processing a long document (e.g. 10+ pages), cTAKES would take a long
time to process.
> I would like to see a feature where we can partition the input to cTAKES, in a way that
won't affect the cTAKES annotation performance, allowing us to process through a cluster running
in distributed mode (e.g. Spark streaming cTAKES).  And then recombine the results such that
the word/phrase token positions will be sequentially ordered.
> We have a simple implementation of the ClinicalPipelineFactory with Spark Streaming.
 Currently our initial attempt in partitioning is by paragraphs. For example, we are doing
something like:
> RDD.map(a_single_paragraph.process_in_ctakes())
> I also wanted to see if there are any better ways of doing this.  

This message was sent by Atlassian JIRA

View raw message