Return-Path: X-Original-To: apmail-ctakes-user-archive@www.apache.org Delivered-To: apmail-ctakes-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28EC91095B for ; Tue, 6 Aug 2013 16:15:54 +0000 (UTC) Received: (qmail 23341 invoked by uid 500); 6 Aug 2013 16:15:53 -0000 Delivered-To: apmail-ctakes-user-archive@ctakes.apache.org Received: (qmail 23319 invoked by uid 500); 6 Aug 2013 16:15:53 -0000 Mailing-List: contact user-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ctakes.apache.org Delivered-To: mailing list user@ctakes.apache.org Received: (qmail 23308 invoked by uid 99); 6 Aug 2013 16:15:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Aug 2013 16:15:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Timothy.Miller@childrens.harvard.edu designates 134.174.13.92 as permitted sender) Received: from [134.174.13.92] (HELO mailsmtp2.childrenshospital.org) (134.174.13.92) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Aug 2013 16:15:47 +0000 Received: from pps.filterd (mailsmtp2.childrenshospital.org [127.0.0.1]) by mailsmtp2.childrenshospital.org (8.14.5/8.14.5) with SMTP id r76G8j6U030053 for ; Tue, 6 Aug 2013 12:15:24 -0400 Received: from smtpbdc1.chboston.org (smtpbdc1.chboston.org [10.20.18.104]) by mailsmtp2.childrenshospital.org with ESMTP id 1e31x41jvq-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Tue, 06 Aug 2013 12:15:24 -0400 Received: from pps.filterd (smtpbdc1.chboston.org [127.0.0.1]) by smtpbdc1.chboston.org (8.14.5/8.14.5) with SMTP id r76G9rlN003447 for ; Tue, 6 Aug 2013 12:15:24 -0400 Received: from chexhubcas2.chboston.org (chexhubcas2.chboston.org [10.20.50.93]) by smtpbdc1.chboston.org with ESMTP id 1dsyp76j5f-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 06 Aug 2013 12:15:24 -0400 Received: from [10.7.2.218] (10.7.2.218) by email.tch.harvard.edu (10.20.50.93) with Microsoft SMTP Server (TLS) id 14.2.342.3; Tue, 6 Aug 2013 12:15:23 -0400 Message-ID: <52012110.7090703@childrens.harvard.edu> Date: Tue, 6 Aug 2013 12:15:12 -0400 From: Tim Miller User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130623 Thunderbird/17.0.7 MIME-Version: 1.0 To: Subject: Re: Extracting Symptoms References: <739EEBD7287586409FB46BBDFEE89B624BCD61CD@MCL-EXMB02.mfldclin.org> In-Reply-To: <739EEBD7287586409FB46BBDFEE89B624BCD61CD@MCL-EXMB02.mfldclin.org> Content-Type: multipart/alternative; boundary="------------070107000808050306000004" X-Originating-IP: [10.7.2.218] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-08-05_07:2013-08-05,2013-08-05,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-08-05_07:2013-08-05,2013-08-05,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org --------------070107000808050306000004 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit I don't know of anyone that's done exactly what you're asking, but I think it's a really interesting idea. My first thought was that you could try the Finding typeID which would be one level less granular the TUIs. But that covers many more TUIs: T033,T034,T040,T041,T042,T043,T044,T045,T046,T056,T057,T184 that contains T184, but also the noisier T033 and T047, along with many others! So that would make your problem worse. Unfortunately it sounds like from what you're saying that the UMLS doesn't have the granularity in the places that you need to represent only the findings that you're interested in. Are there any examples of the types of things that come up from T033 and T047 that you aren't interested in? I'm wondering if there's a pattern that you may be able to write rules to find so that you can over-generate and then filter with those rules. Just throwing out a simple idea. Tim Do you think if you moved to one level more abstract you would get too much? On 08/06/2013 11:47 AM, Bohne, Jacqueline R wrote: > > We are trying to create a cTAKES process that will extract all > symptoms from our documents. In our first attempt, we used the UMLS > dictionary and pulled anything with a TUI of T184 (Sign or Symptom). > While this worked, we found that when we compared it to what our > Research Coordinators manually abstracted as symptoms, there were > quite a few differences. When we looked into these differences we > found a lot of the extra terms were considered either Findings (T033) > or Disease or Syndrome (T047) in UMLS. We would rather not just add > these TUIs to our NLP process because then we would end up with many > more terms than just symptoms in our results. > > Has anyone else tried to create a database of symptoms using NLP? Or > are you aware of a better solution for creating a symptoms database? > > Thank you for your time! > > Thanks, > > Jacquie Bohne > > Research Programmer/Analyst > > Marshfield Clinic > > ------------------------------------------------------------------------ > The contents of this message may contain private, protected and/or > privileged information. If you received this message in error, you > should destroy the e-mail message and any attachments or copies, and > you are prohibited from retaining, distributing, disclosing or using > any information contained within. Please contact the sender and advise > of the erroneous delivery by return e-mail or telephone. Thank you for > your cooperation. --------------070107000808050306000004 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit I don't know of anyone that's done exactly what you're asking, but I think it's a really interesting idea. My first thought was that you could try the Finding typeID which would be one level less granular the TUIs. But that covers many more TUIs:
T033,T034,T040,T041,T042,T043,T044,T045,T046,T056,T057,T184

that contains T184, but also the noisier T033 and T047, along with many others! So that would make your problem worse.

Unfortunately it sounds like from what you're saying that the UMLS doesn't have the granularity in the places that you need to represent only the findings that you're interested in.

Are there any examples of the types of things that come up from T033 and T047 that you aren't interested in? I'm wondering if there's a pattern that you may be able to write rules to find so that you can over-generate and then filter with those rules. Just throwing out a simple idea.

Tim


Do you think if you moved to one level more abstract you would get too much?
On 08/06/2013 11:47 AM, Bohne, Jacqueline R wrote:

We are trying to create a cTAKES process that will extract all symptoms from our documents.  In our first attempt, we used the UMLS dictionary and pulled anything with a TUI of T184 (Sign or Symptom).  While this worked, we found that when we compared it to what our Research Coordinators manually abstracted as symptoms, there were quite a few differences.  When we looked into these differences we found a lot of the extra terms were considered either Findings (T033) or Disease or Syndrome (T047) in UMLS.  We would rather not just add these TUIs to our NLP process because then we would end up with many more terms than just symptoms in our results. 

 

Has anyone else tried to create a database of symptoms using NLP?  Or are you aware of a better solution for creating a symptoms database?

 

Thank you for your time!

 

Thanks,

Jacquie Bohne

Research Programmer/Analyst

Marshfield Clinic


The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.

--------------070107000808050306000004--