Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14545100F0 for ; Thu, 17 Apr 2014 16:27:59 +0000 (UTC) Received: (qmail 5613 invoked by uid 500); 17 Apr 2014 16:27:58 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 5548 invoked by uid 500); 17 Apr 2014 16:27:58 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 5540 invoked by uid 99); 17 Apr 2014 16:27:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 16:27:57 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Timothy.Miller@childrens.harvard.edu designates 134.174.13.92 as permitted sender) Received: from [134.174.13.92] (HELO mailsmtp2.childrenshospital.org) (134.174.13.92) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 16:27:52 +0000 Received: from pps.filterd (mailsmtp2.childrenshospital.org [127.0.0.1]) by mailsmtp2.childrenshospital.org (8.14.5/8.14.5) with SMTP id s3HGMpFp018274 for ; Thu, 17 Apr 2014 12:27:24 -0400 Received: from smtpndc1.chboston.org (smtpndc1.chboston.org [10.20.50.104]) by mailsmtp2.childrenshospital.org with ESMTP id 1kads1sc17-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 17 Apr 2014 12:27:24 -0400 Received: from pps.filterd (smtpndc1.chboston.org [127.0.0.1]) by smtpndc1.chboston.org (8.14.5/8.14.5) with SMTP id s3HGPRVR008951 for ; Thu, 17 Apr 2014 12:27:23 -0400 Received: from chexhubcasbdc2.chboston.org (chexhubcasbdc2.chboston.org [10.20.18.93]) by smtpndc1.chboston.org with ESMTP id 1k4r4nnpwh-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Thu, 17 Apr 2014 12:27:23 -0400 Received: from CHEXMBX3A.CHBOSTON.ORG ([fe80::8df1:9966:b0b0:841d]) by CHEXHUBCASBDC2.CHBOSTON.ORG ([::1]) with mapi id 14.03.0169.001; Thu, 17 Apr 2014 12:27:22 -0400 From: "Miller, Timothy" To: "dev@ctakes.apache.org" Subject: lvg entries Thread-Topic: lvg entries Thread-Index: Ac9aWeh+8tnv4IK+T8igRnGzLt2WCA== Date: Thu, 17 Apr 2014 16:27:22 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.7.2.218] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.96,1.0.14,0.0.0000 definitions=2014-04-17_04:2014-04-17,2014-04-17,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.96,1.0.14,0.0.0000 definitions=2014-04-17_04:2014-04-17,2014-04-17,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1404170237 X-Virus-Checked: Checked by ClamAV on apache.org The LVG annotator creates an enormous number of "lemmas" for every=0A= WordToken in the CAS, and I'm wondering what the original purpose was? I=0A= think this is probably a minor bottleneck for speed but mostly a pretty=0A= big space hog (at least 50% of the space of xmi files in my tests).=0A= =0A= As of right now I'm not sure if any downstream components are using=0A= these lemmas, and on a manual inspection the precision seems to be=0A= pretty abysmal (meaning most of them are nonsensical as lexical=0A= variants), so as I said, just wondering if we can revisit why cTAKES=0A= generates so many and whether that component can be optimized.=0A= =0A= Thanks=0A= Tim=0A= =0A=