incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <>
Subject Re: Multiple processing pipelines for cTAKES
Date Wed, 23 Jan 2013 20:50:29 GMT
I am starting to think that the pipelines dont quite work the way I
thought, but have not yet had a chance to run down what is going on. Will
keep you posted or happy to work together at your convenience

On Wednesday, January 23, 2013, Kim Ebert wrote:

> Karthik,
> I was wondering if you have had any success in combining the patches? Was
> the output equivalent?
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.**com/ <>
> On 01/17/2013 01:15 PM, Karthik Sarma wrote:
>> Hi,
>> Thanks Kim! I've been working on something similar myself, so I'll just go
>> ahead and combine the patches today and do some preliminary testing on one
>> of my datasets to see if the output is equivalent.
>> Vijay -- that's quite interesting. I'm pretty sure I'm not actually using
>> any lucene... I'm using DictionaryLookupAnnotatorDB configured for a local
>> tokenized UMLS install (with a snomed map table), and I've even gone so
>> far
>> as to comment out everything related to the lucene RXNORM/Orange Book
>> dictionaries in both that file as well as LookupDesc_Db even though I'm
>> pretty sure that those dictionaries are tiny. Even so, my footprint is
>> above 2GB. I'll have to take a look to see if Pei is right about the
>> models
>> chewing up all the memory.
>> I suppose that one possibility is that for some reason using the "UMLS"
>> pipeline (with the web API) instead of the DB pipeline (with a local
>> install) has a much smaller memory imprint. I've found that using the UMLS
>> pipeline slows things down considerably for me, presumably because the
>> limiting factor becomes the web API throughput. Running a bunch of them at
>> once would certainly mitigate this factor, but I would think that running
>> a
>> bunch against a local DB would be faster still.
>> Karthik
>> --
>> Karthik Sarma
>> UCLA Medical Scientist Training Program Class of 20??
>> Member, UCLA Medical Imaging&  Informatics Lab
>> Member, CA Delegation to the House of Delegates of the American Medical
>> Association
>> gchat:
>> linkedin:
>> On Thu, Jan 17, 2013 at 9:43 AM, Kim Ebert
>> <>wrote:
>>  Hi Sarma and Pei,
>>> It appears LVG is using static variables for basic string functions.
>>> I've attached a patch that may allow multiple instances to be run in
>>> parallel; however the library is still not thread safe. I.E. you can't
>>> have
>>> multiple threads using the same instance.
>>> I haven't done adequate testing to see if this solves the entire problem,
>>> so use at your own risk.
>>> The source code this patch applies to is available here.
>>> release/lvg2010.tgz<http://****
>>> LexSysGroup/Projects/lvg/2010/**release/lvg2010.tgz<>
>>> >
>>> Let me know how this works for you.
>>> Thanks,
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.****com/<http://www.**
>>> <>>
>>> On 01/16/2013 06:50 PM, Chen, Pei wrote:
>>>  Hi Sarma,
>>>> I encountered the same issue(s) with LVG with multiple threads in the
>>>> same JVM process. We've been scaling out by spawning off multiple
>>>> pipelines
>>>> in different processes.
>>>> However, it would be interesting to see identified which components are
>>>> not thread safe and take advantage of spawning multiple components in
>>>> the
>>>> same process.
>>>> Another area for optimization as you pointed out is the mem footprint.
>>>> It
>>>> would be good if someone has a chance to profile the mem usage and see
>>>> if
>>>> we could lower the footprint- my initial hunch is that all of the models
>>>> are loaded into memory as a cache.
>>>> If you're interested, feel free to open a Jira so it could be tracked
>>>> you
>>>> could get credit for the contributions..
>>>> -Pei
>>>> On Jan 16, 2013, at 5:49 PM, "Karthik Sarma"<>
>>>> wrote:
>>>>   Hi folks,
>>>>> I know that the official position is that cTAKES is not thread-safe.
>>>>> I'm
>>>>> wondering, however, if anyone has looked into using multiple processing
>>>>> pipelines (via the processingUnitThreadCount directive in a CPE
>>>>> descriptor
>>>>> and documenting where the thread safety problems lie.
>>>>> I've given it a bit of a try, and on first glance the biggest issue
>>>>> seems
>>>>> to be in the LVG api, which isn't at all thread-safe (they seem to
>>>>> claim
>>>>> that it would be thread-safe so long as API instances are not shared,
>>>>> but
>>>>> that doesn't seem prima facie true since it throws errors when multiple
>>>>> pipelines are used, which *should* be creating multiple LVG api
>>>>> instances).
>>>>> I haven't found any other serious issues, but perhaps you folks might
>>>>> be
>>>>> familiar with some.
>>>>> There is, of course, the memory issue -- cTAKES' memory footprint alone
>>>>> on
>>>>> my machine with a single pipeline and using a mysql umls database is
>>>>> over
>>>>> 2GB; this is presumably the cost of each pipeline, though I can't
>>>>> actually
>>>>> really figure out what all that memory is being used for since none of
>>>>> the
>>>>> in-memory DBs and indexes used seem to be anywhere near that size.
>>>>> It is, of course, possible to split datasets and simply run multiple
>>>>> processes, but my feeling is that there must be a lot of unnecessary
>>>>> overhead there since all the operations we actually do (other than the
>>>>> CAS
>>>>> consumers) are read-only. It seems to me that cTAKES ought to be
>>>>> limited
>>>>> only by disk/memory throughput and total CPU capacity because of the
>>>>> nature
>>>>> of the load...
>>>>> Anyway, if anyone else has thoughts, I'd be interested. This is
>>>>> something
>>>>> I'd be interested in taking a stab at resolving, since I've been poking
>>>>> around in this direction behind the scenes for some time now. My group
>>>>> has
>>>>> access to huge databases but limited computational resources, and I'd
>>>>> like
>>>>> to make the most of what we've got!
>>>>> Karthik
>>>>> --
>>>>> Karthik Sarma
>>>>> UCLA Medical Scientist Training Program Class of 20??
>>>>> Member, UCLA Medical Imaging&   Informatics Lab
>>>>> Member, CA Delegation to the House of Delegates of the American Medical
>>>>> Association
>>>>> gchat:
>>>>> linkedin:

Sent from Gmail Mobile

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message