incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <ksa...@ksarma.com>
Subject Multiple processing pipelines for cTAKES
Date Wed, 16 Jan 2013 22:48:02 GMT
Hi folks,

I know that the official position is that cTAKES is not thread-safe. I'm
wondering, however, if anyone has looked into using multiple processing
pipelines (via the processingUnitThreadCount directive in a CPE descriptor
and documenting where the thread safety problems lie.

I've given it a bit of a try, and on first glance the biggest issue seems
to be in the LVG api, which isn't at all thread-safe (they seem to claim
that it would be thread-safe so long as API instances are not shared, but
that doesn't seem prima facie true since it throws errors when multiple
pipelines are used, which *should* be creating multiple LVG api instances).

I haven't found any other serious issues, but perhaps you folks might be
familiar with some.

There is, of course, the memory issue -- cTAKES' memory footprint alone on
my machine with a single pipeline and using a mysql umls database is over
2GB; this is presumably the cost of each pipeline, though I can't actually
really figure out what all that memory is being used for since none of the
in-memory DBs and indexes used seem to be anywhere near that size.

It is, of course, possible to split datasets and simply run multiple
processes, but my feeling is that there must be a lot of unnecessary
overhead there since all the operations we actually do (other than the CAS
consumers) are read-only. It seems to me that cTAKES ought to be limited
only by disk/memory throughput and total CPU capacity because of the nature
of the load...

Anyway, if anyone else has thoughts, I'd be interested. This is something
I'd be interested in taking a stab at resolving, since I've been poking
around in this direction behind the scenes for some time now. My group has
access to huge databases but limited computational resources, and I'd like
to make the most of what we've got!

Karthik


--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message