Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA96E1020A for ; Thu, 17 Apr 2014 22:17:35 +0000 (UTC) Received: (qmail 28940 invoked by uid 500); 17 Apr 2014 22:17:34 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 28900 invoked by uid 500); 17 Apr 2014 22:17:34 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 28892 invoked by uid 99); 17 Apr 2014 22:17:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 22:17:34 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=MSGID_FROM_MTA_HEADER,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [129.176.114.197] (HELO mail9.mayo.edu) (129.176.114.197) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 22:17:29 +0000 Received: from unknown (HELO mail10.mayo.edu) ([10.146.66.179]) by ironport9-poly.mayo.edu with ESMTP; 17 Apr 2014 17:13:16 -0500 Message-Id: <6e55ab$8ll5jh@ironport10.mayo.edu> X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEALlRUFMKgNEM/2dsb2JhbABZyACBPnSCJQEBBToCAUgEAgEIDQQEAQEBJwdGCQgBAQQTiAHGD4YUjwAGgneBOwSfbo5+ Received: from unknown (HELO msgoms04.mayo.edu) ([10.128.209.12]) by ironport10.mayo.edu with ESMTP; 17 Apr 2014 17:16:52 -0500 Date: Thu, 17 Apr 2014 22:16:58 +0000 From: "Masanz, James J." Subject: RE: lvg entries In-reply-to: To: "'dev@ctakes.apache.org'" MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-language: en-US Content-transfer-encoding: quoted-printable Accept-Language: en-US Thread-topic: lvg entries Thread-index: Ac9aWeh+8tnv4IK+T8igRnGzLt2WCAAMCprg X-MS-Has-Attach: X-MS-TNEF-Correlator: References: <6e55ab$8ljq3o@ironport10.mayo.edu> X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org Before the switch to OpenNLP (which was done before the first opensource re= lease of cTAKES), I believe the Lemma annotations were used by the POS tagg= er and/or phrasal parser. As far as I know, that was the original intentio= n of the Lemmas. I believe they were turned off by default for some release= s, until someone started to use them (or at least look at maybe using them) That's all just from memory. We'd have to look through histories to see whe= n things changed. I don't think the Lemma annotations were ever used for dictionary lookup. T= hat used the (single) output of the normalizer function of the LVG componen= t -----Original Message----- From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]=20 Sent: Thursday, April 17, 2014 3:34 PM To: dev@ctakes.apache.org Subject: Re: lvg entries Thanks James. Does it ring a bell to you that the original intention was something like query expansion for a dictionary lookup? Tim On 04/17/2014 01:57 PM, Masanz, James J. wrote: > Offhand I recall at least one of the dependency parsers used the Lemma an= notations at one point. > Not sure if still does. > > There is an option for turning off the posting of the lemmas to the cas. > > Hope that helps > > -----Original Message----- > From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]=20 > Sent: Thursday, April 17, 2014 11:27 AM > To: dev@ctakes.apache.org > Subject: lvg entries > > The LVG annotator creates an enormous number of "lemmas" for every > WordToken in the CAS, and I'm wondering what the original purpose was? I > think this is probably a minor bottleneck for speed but mostly a pretty > big space hog (at least 50% of the space of xmi files in my tests). > > As of right now I'm not sure if any downstream components are using > these lemmas, and on a manual inspection the precision seems to be > pretty abysmal (meaning most of them are nonsensical as lexical > variants), so as I said, just wondering if we can revisit why cTAKES > generates so many and whether that component can be optimized. > > Thanks > Tim > >