incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen, Pei" <Pei.C...@childrens.harvard.edu>
Subject Re: Multiple processing pipelines for cTAKES
Date Thu, 17 Jan 2013 01:50:13 GMT
Hi Sarma,
I encountered the same issue(s) with LVG with multiple threads in the same JVM process. We've
been scaling out by spawning off multiple pipelines in different processes. 
However, it would be interesting to see identified which components are not thread safe and
take advantage of spawning multiple components in the same process.
Another area for optimization as you pointed out is the mem footprint. It would be good if
someone has a chance to profile the mem usage and see if we could lower the footprint- my
initial hunch is that all of the models are loaded into memory as a cache. 
If you're interested, feel free to open a Jira so it could be tracked you could get credit
for the contributions..
-Pei


On Jan 16, 2013, at 5:49 PM, "Karthik Sarma" <ksarma@ksarma.com> wrote:

> Hi folks,
> 
> I know that the official position is that cTAKES is not thread-safe. I'm
> wondering, however, if anyone has looked into using multiple processing
> pipelines (via the processingUnitThreadCount directive in a CPE descriptor
> and documenting where the thread safety problems lie.
> 
> I've given it a bit of a try, and on first glance the biggest issue seems
> to be in the LVG api, which isn't at all thread-safe (they seem to claim
> that it would be thread-safe so long as API instances are not shared, but
> that doesn't seem prima facie true since it throws errors when multiple
> pipelines are used, which *should* be creating multiple LVG api instances).
> 
> I haven't found any other serious issues, but perhaps you folks might be
> familiar with some.
> 
> There is, of course, the memory issue -- cTAKES' memory footprint alone on
> my machine with a single pipeline and using a mysql umls database is over
> 2GB; this is presumably the cost of each pipeline, though I can't actually
> really figure out what all that memory is being used for since none of the
> in-memory DBs and indexes used seem to be anywhere near that size.
> 
> It is, of course, possible to split datasets and simply run multiple
> processes, but my feeling is that there must be a lot of unnecessary
> overhead there since all the operations we actually do (other than the CAS
> consumers) are read-only. It seems to me that cTAKES ought to be limited
> only by disk/memory throughput and total CPU capacity because of the nature
> of the load...
> 
> Anyway, if anyone else has thoughts, I'd be interested. This is something
> I'd be interested in taking a stab at resolving, since I've been poking
> around in this direction behind the scenes for some time now. My group has
> access to huge databases but limited computational resources, and I'd like
> to make the most of what we've got!
> 
> Karthik
> 
> 
> --
> Karthik Sarma
> UCLA Medical Scientist Training Program Class of 20??
> Member, UCLA Medical Imaging & Informatics Lab
> Member, CA Delegation to the House of Delegates of the American Medical
> Association
> ksarma@ksarma.com
> gchat: ksarma@gmail.com
> linkedin: www.linkedin.com/in/ksarma

Mime
View raw message