incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <ksa...@ksarma.com>
Subject Re: Multiple processing pipelines for cTAKES
Date Wed, 23 Jan 2013 20:50:29 GMT
I am starting to think that the pipelines dont quite work the way I
thought, but have not yet had a chance to run down what is going on. Will
keep you posted or happy to work together at your convenience

On Wednesday, January 23, 2013, Kim Ebert wrote:

> Karthik,
>
> I was wondering if you have had any success in combining the patches? Was
> the output equivalent?
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.**com/ <http://www.perfectsearchcorp.com/>
>
>
> On 01/17/2013 01:15 PM, Karthik Sarma wrote:
>
>> Hi,
>>
>> Thanks Kim! I've been working on something similar myself, so I'll just go
>> ahead and combine the patches today and do some preliminary testing on one
>> of my datasets to see if the output is equivalent.
>>
>> Vijay -- that's quite interesting. I'm pretty sure I'm not actually using
>> any lucene... I'm using DictionaryLookupAnnotatorDB configured for a local
>> tokenized UMLS install (with a snomed map table), and I've even gone so
>> far
>> as to comment out everything related to the lucene RXNORM/Orange Book
>> dictionaries in both that file as well as LookupDesc_Db even though I'm
>> pretty sure that those dictionaries are tiny. Even so, my footprint is
>> above 2GB. I'll have to take a look to see if Pei is right about the
>> models
>> chewing up all the memory.
>>
>> I suppose that one possibility is that for some reason using the "UMLS"
>> pipeline (with the web API) instead of the DB pipeline (with a local
>> install) has a much smaller memory imprint. I've found that using the UMLS
>> pipeline slows things down considerably for me, presumably because the
>> limiting factor becomes the web API throughput. Running a bunch of them at
>> once would certainly mitigate this factor, but I would think that running
>> a
>> bunch against a local DB would be faster still.
>>
>> Karthik
>>
>>
>>
>>
>>
>> --
>> Karthik Sarma
>> UCLA Medical Scientist Training Program Class of 20??
>> Member, UCLA Medical Imaging&  Informatics Lab
>> Member, CA Delegation to the House of Delegates of the American Medical
>> Association
>> ksarma@ksarma.com
>> gchat: ksarma@gmail.com
>> linkedin: www.linkedin.com/in/ksarma
>>
>>
>> On Thu, Jan 17, 2013 at 9:43 AM, Kim Ebert
>> <kim.ebert@perfectsearchcorp.com>wrote:
>>
>>  Hi Sarma and Pei,
>>>
>>> It appears LVG is using static variables for basic string functions.
>>>
>>> I've attached a patch that may allow multiple instances to be run in
>>> parallel; however the library is still not thread safe. I.E. you can't
>>> have
>>> multiple threads using the same instance.
>>>
>>> I haven't done adequate testing to see if this solves the entire problem,
>>> so use at your own risk.
>>>
>>> The source code this patch applies to is available here.
>>>
>>> http://lexsrv3.nlm.nih.gov/****LexSysGroup/Projects/lvg/2010/****<http://lexsrv3.nlm.nih.gov/**LexSysGroup/Projects/lvg/2010/**>
>>> release/lvg2010.tgz<http://**lexsrv3.nlm.nih.gov/**
>>> LexSysGroup/Projects/lvg/2010/**release/lvg2010.tgz<http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2010/release/lvg2010.tgz>
>>> >
>>>
>>> Let me know how this works for you.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.****com/<http://www.**
>>> perfectsearchcorp.com/ <http://www.perfectsearchcorp.com/>>
>>>
>>>
>>>
>>> On 01/16/2013 06:50 PM, Chen, Pei wrote:
>>>
>>>  Hi Sarma,
>>>> I encountered the same issue(s) with LVG with multiple threads in the
>>>> same JVM process. We've been scaling out by spawning off multiple
>>>> pipelines
>>>> in different processes.
>>>> However, it would be interesting to see identified which components are
>>>> not thread safe and take advantage of spawning multiple components in
>>>> the
>>>> same process.
>>>> Another area for optimization as you pointed out is the mem footprint.
>>>> It
>>>> would be good if someone has a chance to profile the mem usage and see
>>>> if
>>>> we could lower the footprint- my initial hunch is that all of the models
>>>> are loaded into memory as a cache.
>>>> If you're interested, feel free to open a Jira so it could be tracked
>>>> you
>>>> could get credit for the contributions..
>>>> -Pei
>>>>
>>>>
>>>> On Jan 16, 2013, at 5:49 PM, "Karthik Sarma"<ksarma@ksarma.com>
>>>> wrote:
>>>>
>>>>   Hi folks,
>>>>
>>>>> I know that the official position is that cTAKES is not thread-safe.
>>>>> I'm
>>>>> wondering, however, if anyone has looked into using multiple processing
>>>>> pipelines (via the processingUnitThreadCount directive in a CPE
>>>>> descriptor
>>>>> and documenting where the thread safety problems lie.
>>>>>
>>>>> I've given it a bit of a try, and on first glance the biggest issue
>>>>> seems
>>>>> to be in the LVG api, which isn't at all thread-safe (they seem to
>>>>> claim
>>>>> that it would be thread-safe so long as API instances are not shared,
>>>>> but
>>>>> that doesn't seem prima facie true since it throws errors when multiple
>>>>> pipelines are used, which *should* be creating multiple LVG api
>>>>> instances).
>>>>>
>>>>> I haven't found any other serious issues, but perhaps you folks might
>>>>> be
>>>>> familiar with some.
>>>>>
>>>>> There is, of course, the memory issue -- cTAKES' memory footprint alone
>>>>> on
>>>>> my machine with a single pipeline and using a mysql umls database is
>>>>> over
>>>>> 2GB; this is presumably the cost of each pipeline, though I can't
>>>>> actually
>>>>> really figure out what all that memory is being used for since none of
>>>>> the
>>>>> in-memory DBs and indexes used seem to be anywhere near that size.
>>>>>
>>>>> It is, of course, possible to split datasets and simply run multiple
>>>>> processes, but my feeling is that there must be a lot of unnecessary
>>>>> overhead there since all the operations we actually do (other than the
>>>>> CAS
>>>>> consumers) are read-only. It seems to me that cTAKES ought to be
>>>>> limited
>>>>> only by disk/memory throughput and total CPU capacity because of the
>>>>> nature
>>>>> of the load...
>>>>>
>>>>> Anyway, if anyone else has thoughts, I'd be interested. This is
>>>>> something
>>>>> I'd be interested in taking a stab at resolving, since I've been poking
>>>>> around in this direction behind the scenes for some time now. My group
>>>>> has
>>>>> access to huge databases but limited computational resources, and I'd
>>>>> like
>>>>> to make the most of what we've got!
>>>>>
>>>>> Karthik
>>>>>
>>>>>
>>>>> --
>>>>> Karthik Sarma
>>>>> UCLA Medical Scientist Training Program Class of 20??
>>>>> Member, UCLA Medical Imaging&   Informatics Lab
>>>>>
>>>>> Member, CA Delegation to the House of Delegates of the American Medical
>>>>> Association
>>>>> ksarma@ksarma.com
>>>>> gchat: ksarma@gmail.com
>>>>> linkedin: www.linkedin.com/in/ksarma
>>>>>
>>>>>

-- 
Sent from Gmail Mobile

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message