Mailing-List: contact user-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@ctakes.apache.org
Received-SPF: pass (athena.apache.org: domain of
 Timothy.Miller@childrens.harvard.edu designates 134.174.13.92 as permitted
 sender)
Message-ID: <52012110.7090703@childrens.harvard.edu>
Date: Tue, 6 Aug 2013 12:15:12 -0400
From: Tim Miller <timothy.miller@childrens.harvard.edu>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130623 Thunderbird/17.0.7
MIME-Version: 1.0
To: <user@ctakes.apache.org>
Subject: Re: Extracting Symptoms
References: <739EEBD7287586409FB46BBDFEE89B624BCD61CD@MCL-EXMB02.mfldclin.org>
In-Reply-To: 
 <739EEBD7287586409FB46BBDFEE89B624BCD61CD@MCL-EXMB02.mfldclin.org>
Content-Type: multipart/alternative;
	boundary="------------070107000808050306000004"

--------------070107000808050306000004
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

I don't know of anyone that's done exactly what you're asking, but I 
think it's a really interesting idea. My first thought was that you 
could try the Finding typeID which would be one level less granular the 
TUIs. But that covers many more TUIs:
T033,T034,T040,T041,T042,T043,T044,T045,T046,T056,T057,T184

that contains T184, but also the noisier T033 and T047, along with many 
others! So that would make your problem worse.

Unfortunately it sounds like from what you're saying that the UMLS 
doesn't have the granularity in the places that you need to represent 
only the findings that you're interested in.

Are there any examples of the types of things that come up from T033 and 
T047 that you aren't interested in? I'm wondering if there's a pattern 
that you may be able to write rules to find so that you can 
over-generate and then filter with those rules. Just throwing out a 
simple idea.

Tim


Do you think if you moved to one level more abstract you would get too 
much?
On 08/06/2013 11:47 AM, Bohne, Jacqueline R wrote:
>
> We are trying to create a cTAKES process that will extract all 
> symptoms from our documents.  In our first attempt, we used the UMLS 
> dictionary and pulled anything with a TUI of T184 (Sign or Symptom).  
> While this worked, we found that when we compared it to what our 
> Research Coordinators manually abstracted as symptoms, there were 
> quite a few differences.  When we looked into these differences we 
> found a lot of the extra terms were considered either Findings (T033) 
> or Disease or Syndrome (T047) in UMLS.  We would rather not just add 
> these TUIs to our NLP process because then we would end up with many 
> more terms than just symptoms in our results.
>
> Has anyone else tried to create a database of symptoms using NLP?  Or 
> are you aware of a better solution for creating a symptoms database?
>
> Thank you for your time!
>
> Thanks,
>
> Jacquie Bohne
>
> Research Programmer/Analyst
>
> Marshfield Clinic
>
> ------------------------------------------------------------------------
> The contents of this message may contain private, protected and/or 
> privileged information. If you received this message in error, you 
> should destroy the e-mail message and any attachments or copies, and 
> you are prohibited from retaining, distributing, disclosing or using 
> any information contained within. Please contact the sender and advise 
> of the erroneous delivery by return e-mail or telephone. Thank you for 
> your cooperation.


--------------070107000808050306000004
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    I don't know of anyone that's done exactly what you're asking, but I
    think it's a really interesting idea. My first thought was that you
    could try the Finding typeID which would be one level less granular
    the TUIs. But that covers many more TUIs:<br>
    T033,T034,T040,T041,T042,T043,T044,T045,T046,T056,T057,T184<br>
    <br>
    that contains T184, but also the noisier T033 and T047, along with
    many others! So that would make your problem worse.<br>
    <br>
    Unfortunately it sounds like from what you're saying that the UMLS
    doesn't have the granularity in the places that you need to
    represent only the findings that you're interested in.<br>
    <br>
    Are there any examples of the types of things that come up from T033
    and T047 that you aren't interested in? I'm wondering if there's a
    pattern that you may be able to write rules to find so that you can
    over-generate and then filter with those rules. Just throwing out a
    simple idea.<br>
    <br>
    Tim<br>
    <br>
    <br>
    Do you think if you moved to one level more abstract you would get
    too much? <br>
    <div class="moz-cite-prefix">On 08/06/2013 11:47 AM, Bohne,
      Jacqueline R wrote:<br>
    </div>
    <blockquote
cite="mid:739EEBD7287586409FB46BBDFEE89B624BCD61CD@MCL-EXMB02.mfldclin.org"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      <meta name="Generator" content="Microsoft Word 14 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal">We are trying to create a cTAKES process
          that will extract all symptoms from our documents.&nbsp; In our
          first attempt, we used the UMLS dictionary and pulled anything
          with a TUI of T184 (Sign or Symptom).&nbsp; While this worked, we
          found that when we compared it to what our Research
          Coordinators manually abstracted as symptoms, there were quite
          a few differences.&nbsp; When we looked into these differences we
          found a lot of the extra terms were considered either Findings
          (T033) or Disease or Syndrome (T047) in UMLS.&nbsp; We would rather
          not just add these TUIs to our NLP process because then we
          would end up with many more terms than just symptoms in our
          results.&nbsp;
          <o:p></o:p></p>
        <p class="MsoNormal"><o:p>&nbsp;</o:p></p>
        <p class="MsoNormal">Has anyone else tried to create a database
          of symptoms using NLP?&nbsp; Or are you aware of a better solution
          for creating a symptoms database?<o:p></o:p></p>
        <p class="MsoNormal"><o:p>&nbsp;</o:p></p>
        <p class="MsoNormal">Thank you for your time!<o:p></o:p></p>
        <p class="MsoNormal"><o:p>&nbsp;</o:p></p>
        <p class="MsoNormal">Thanks,<o:p></o:p></p>
        <p class="MsoNormal">Jacquie Bohne<o:p></o:p></p>
        <p class="MsoNormal">Research Programmer/Analyst<o:p></o:p></p>
        <p class="MsoNormal">Marshfield Clinic<o:p></o:p></p>
      </div>
      <hr>The contents of this message may contain private, protected
      and/or privileged information. If you received this message in
      error, you should destroy the e-mail message and any attachments
      or copies, and you are prohibited from retaining, distributing,
      disclosing or using any information contained within. Please
      contact the sender and advise of the erroneous delivery by return
      e-mail or telephone. Thank you for your cooperation.<br>
    </blockquote>
    <br>
  </body>
</html>

--------------070107000808050306000004--