Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F08BE10E65 for ; Tue, 12 Nov 2013 23:17:28 +0000 (UTC) Received: (qmail 81479 invoked by uid 500); 12 Nov 2013 23:17:28 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 81446 invoked by uid 500); 12 Nov 2013 23:17:28 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 81438 invoked by uid 99); 12 Nov 2013 23:17:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Nov 2013 23:17:28 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Timothy.Miller@childrens.harvard.edu designates 134.174.13.92 as permitted sender) Received: from [134.174.13.92] (HELO mailsmtp2.childrenshospital.org) (134.174.13.92) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Nov 2013 23:17:22 +0000 Received: from pps.filterd (mailsmtp2.childrenshospital.org [127.0.0.1]) by mailsmtp2.childrenshospital.org (8.14.5/8.14.5) with SMTP id rACNDnda011375 for ; Tue, 12 Nov 2013 18:16:50 -0500 Received: from smtpbdc1.chboston.org (smtpbdc1.chboston.org [10.20.18.104]) by mailsmtp2.childrenshospital.org with ESMTP id 1g3rnxjf87-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Tue, 12 Nov 2013 18:16:50 -0500 Received: from pps.filterd (smtpbdc1.chboston.org [127.0.0.1]) by smtpbdc1.chboston.org (8.14.5/8.14.5) with SMTP id rACNErY2025666 for ; Tue, 12 Nov 2013 18:16:49 -0500 Received: from chexhubcas3.chboston.org (chexhubcas3.chboston.org [10.20.50.91]) by smtpbdc1.chboston.org with ESMTP id 1ftjk4g1ug-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 12 Nov 2013 18:16:49 -0500 Received: from [10.7.2.218] (10.7.2.218) by email.tch.harvard.edu (10.20.50.91) with Microsoft SMTP Server (TLS) id 14.2.342.3; Tue, 12 Nov 2013 18:16:49 -0500 Message-ID: <5282B6D3.2070505@childrens.harvard.edu> Date: Tue, 12 Nov 2013 18:16:35 -0500 From: Tim Miller User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Subject: getContextMap() question Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.7.2.218] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-11-12_08:2013-11-12,2013-11-12,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-11-12_08:2013-11-12,2013-11-12,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org I'm running the default pipeline on some large files and trying to fix some of the slower annotators. I changed ChunkAdjuster to use UimaFit selectors which dramatically improves speed on large files. I removed the OverlapAnnotator, with its complicated interface and extreme generality, from my pipeline altogether and replaced it with a 3-line static annotator. I think we should consider doing that for the default pipeline even if we think there are good reasons to keep the general-purpose annotator around. Anyways, now I'm at the dictionary lookup which I suspect will be the slowest component. One call is to getContextMap() which seems especially slow. It is called for every LookupWindow, and given the span of that window, iterates over all LookupWindow's looking for one with the equivalent span. So in the end you give it a lookup window and it gives you the same one back basically. Of course the code is written very generally so there may be use cases where the types are different, but for the default case it seems a little weird for something doing nothing to take so long. So, my question is, does anyone know what the engineering goals of this setup are? I think it can be optimized even within the super-general framework it is trying to maintain, but I don't want to break anything by making assumptions that aren't valid. Thanks Tim