ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: lvg entries
Date Fri, 18 Apr 2014 18:56:57 GMT
+1 false

-----Original Message-----
From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all the annotators in
the default pipeline, before I create the static factory methods like we recently discussed.
Should I go ahead and change this to make default behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
> There is a lot of config handling, maybe PostLemmas is being set to 
> true or
> configInit() is not setting up  the NLM wrapper incorrectly.
>
> ctakes-lvg *README*
> Note: as distributed, PostLemmas is set to false.  This is done to 
> reduce the size of the CAS.
> Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
> annotations added to the CAS.
>
> *LvgAnnotator.xml *
> PostLemmas = True
>
> *LvgAnnotator.java*
> if (postLemmas) {
>      lvgResource.getLvgLex()
> }
>
>
>
>
>
>
>
> On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. <Masanz.James@mayo.edu>wrote:
>
>> The normalizedForm field is filled in. It is used by dictionary lookup.
>>
>> So, for example, if the dictionary would contain "lymph node" but not 
>> "lymph nodes", a document with text of "lymph nodes" would match the 
>> dictionary entry "lymph node" because "node", being the normalized 
>> form of "nodes", would be used when searching dictionary entries (in 
>> addition to searching dictionary entries for "nodes")
>>
>> -----Original Message-----
>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>> Sent: Thursday, April 17, 2014 4:33 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: lvg entries
>>
>> Quick follow-up since I was interested. The current dependency parser 
>> does have the option to use ctakes lemmas or do its own lemmatizing, 
>> but that doesn't use the lemma field, it uses the normalizedForm 
>> field. I'm not sure if that field is actually ever filled in -- on my 
>> example data it is always null.
>>
>> Tim
>>
>> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>>> Offhand I recall at least one of the dependency parsers used the 
>>> Lemma
>> annotations at one point.
>>> Not sure if still does.
>>>
>>> There is an option for turning off the posting of the lemmas to the cas.
>>>
>>> Hope that helps
>>>
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>> Sent: Thursday, April 17, 2014 11:27 AM
>>> To: dev@ctakes.apache.org
>>> Subject: lvg entries
>>>
>>> The LVG annotator creates an enormous number of "lemmas" for every 
>>> WordToken in the CAS, and I'm wondering what the original purpose 
>>> was? I think this is probably a minor bottleneck for speed but 
>>> mostly a pretty big space hog (at least 50% of the space of xmi files in my tests).
>>>
>>> As of right now I'm not sure if any downstream components are using 
>>> these lemmas, and on a manual inspection the precision seems to be 
>>> pretty abysmal (meaning most of them are nonsensical as lexical 
>>> variants), so as I said, just wondering if we can revisit why cTAKES 
>>> generates so many and whether that component can be optimized.
>>>
>>> Thanks
>>> Tim
>>>
>>>
>>


Mime
View raw message