Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BAACE10AFD for ; Mon, 2 Mar 2015 22:10:02 +0000 (UTC) Received: (qmail 15796 invoked by uid 500); 2 Mar 2015 22:09:53 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 15741 invoked by uid 500); 2 Mar 2015 22:09:53 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 15729 invoked by uid 99); 2 Mar 2015 22:09:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Mar 2015 22:09:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jennifer.seale@gmail.com designates 209.85.216.53 as permitted sender) Received: from [209.85.216.53] (HELO mail-qa0-f53.google.com) (209.85.216.53) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Mar 2015 22:09:48 +0000 Received: by mail-qa0-f53.google.com with SMTP id k15so25409816qaq.12 for ; Mon, 02 Mar 2015 14:07:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=x2lDkVeA+hTad/50nbmHowac8MosRMrjd7Lx6ZxOjJc=; b=Ur1u38ItVQc4YbdyYg1nYnaiueAXMvMu/TVefhaX58N4tWQlMfjGxglLZZ4dQdV5fO jqBR4OxurOUNaTLcp6IjWVDjfhv5CGgEF/TIfFOANXxiq6PrnMFtlmIppqu+vPHdfmLP xTOT17io+TJwbRSY34YMBHxQTgIInps8wbxICDaQMLFiW6bjP/Ai/dCErdVAFUpmK8Gj SGodJWAJRCs91h84c6qvrczKyHTdhlKQWEk8Do9gZe8ZyOIr0nsvD68hzQY4CeNnHHS0 dvh+ezrnxsv9vsiNKfnwpZhiJmtNh2eQOS9CCJbKPMM16FXYsZNgsCFSrTIql270/9i9 f5ag== X-Received: by 10.140.194.204 with SMTP id p195mr54855651qha.21.1425334033155; Mon, 02 Mar 2015 14:07:13 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.20.180 with HTTP; Mon, 2 Mar 2015 14:06:32 -0800 (PST) In-Reply-To: References: From: Jen Seale Date: Mon, 2 Mar 2015 17:06:32 -0500 Message-ID: Subject: Re: head word identification To: dev@ctakes.apache.org Content-Type: multipart/alternative; boundary=001a1143245ebb920c05105570e6 X-Virus-Checked: Checked by ClamAV on apache.org --001a1143245ebb920c05105570e6 Content-Type: text/plain; charset=ISO-8859-1 You could possibly use norm to normalize the entity text strings. I can't vouch for its accuracy at this point, though. Jen Seale Presidential Research Fellow, CUNY Graduate Center 512.705.4030 On Mon, Mar 2, 2015 at 11:29 AM, Dligach, Dmitriy < Dmitriy.Dligach@childrens.harvard.edu> wrote: > Hello, > > Is anybody aware of a reliable way of identifying the head word of a UMLS > entity? In the general domain, people often use Collins rules, but I'm not > sure whether they would be applicable to clinical entities. > > Until recently I was under impression that taking the last word of an > entity would work pretty well, but now that I have looked at the data more > closely, I am not so sure. E.g. it fails in these cases: "breast, left", > "ductal carcinoma in situ", "carcinoma, consistent with breast primary". > > Dima > > > Dmitriy (Dima) Dligach, Ph.D. > Boston Children's Hospital and Harvard Medical School > (617) 651-0397 > > > > --001a1143245ebb920c05105570e6--