Return-Path: X-Original-To: apmail-ctakes-user-archive@www.apache.org Delivered-To: apmail-ctakes-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BBFDCA0D for ; Tue, 10 Sep 2013 17:19:49 +0000 (UTC) Received: (qmail 70686 invoked by uid 500); 10 Sep 2013 17:19:48 -0000 Delivered-To: apmail-ctakes-user-archive@ctakes.apache.org Received: (qmail 70619 invoked by uid 500); 10 Sep 2013 17:19:44 -0000 Mailing-List: contact user-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ctakes.apache.org Delivered-To: mailing list user@ctakes.apache.org Received: (qmail 70606 invoked by uid 99); 10 Sep 2013 17:19:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Sep 2013 17:19:43 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [142.104.5.70] (HELO mole.comp.uvic.ca) (142.104.5.70) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Sep 2013 17:19:36 +0000 Received: from DLHK (76-10-184-254.dsl.teksavvy.com [76.10.184.254]) by mole.comp.uvic.ca (8.14.4/8.14.4) with SMTP id r8AHJ9HH023752 for ; Tue, 10 Sep 2013 10:19:10 -0700 Message-ID: From: "Dennis Lee Hon Kit" To: References: <996FC801C05DF64A84246A106FACACD0183F27@MSGPEXCHA08A.mfad.mfroot.org> <924DE05C19409B438EB81DE683A942D91059F15E@CHEXMBX1A.CHBOSTON.ORG> <310A312801B84CF39192FF557674C61C@DLHK> <996FC801C05DF64A84246A106FACACD018BE92@MSGPEXCHA08A.mfad.mfroot.org> <996FC801C05DF64A84246A106FACACD018C04E@MSGPEXCHA08A.mfad.mfroot.org> <7B76603EA0B44B44BDB96472F1259A45@DLHK> <996FC801C05DF64A84246A106FACACD01A0B33@MSGPEXCHA08A.mfad.mfroot.org> In-Reply-To: <996FC801C05DF64A84246A106FACACD01A0B33@MSGPEXCHA08A.mfad.mfroot.org> Subject: Re: Concept annotation questions Date: Tue, 10 Sep 2013 10:19:12 -0700 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_00DD_01CEAE0F.31D331F0" X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 16.4.3505.912 X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3505.912 X-Antivirus: avast! (VPS 130910-0, 09/10/2013), Outbound message X-Antivirus-Status: Clean X-UVic-Virus-Scanned: OK - Passed virus scan by ClamAV (clamd) on mole X-UVic-Scan: mole.comp.uvic.ca filter_version 3.7.4 X-Scanned-By: MIMEDefang 2.67 on 142.104.5.72 X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. ------=_NextPart_000_00DD_01CEAE0F.31D331F0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi James, Thank you for your email. We are currently using cTakes 3.0 but will = upgrade to which ever version you issue the patch for. Thank you for = taking the time out of your busy schedule to work on the patch. Regards, Dennis From: Masanz, James J.=20 Sent: Monday, September 09, 2013 7:44 AM To: mailto:user@ctakes.apache.org=20 Subject: RE: Concept annotation questions Which version of cTAKES are you using or planning to use. =20 cTAKES 3.1 has been approved and once the apache.org infrastructure team = does some administrative-like tasks the process of having the apache = mirrors updated with 3.1 should start. =20 I want to target the release that will be most useful for you for this = patch first.=20 =20 From: user-return-267-Masanz.James=3Dmayo.edu@ctakes.apache.org = [mailto:user-return-267-Masanz.James=3Dmayo.edu@ctakes.apache.org] On = Behalf Of Dennis Lee Hon Kit Sent: Friday, August 30, 2013 1:11 AM To: user@ctakes.apache.org Subject: Re: Concept annotation questions =20 Hi James, =20 Thank you for your reply. =20 If you could create the patch for identifying the words used in the = matching that would be great. We understand you have other priorities = and will wait until you have time to do it. =20 Thank you for logging the issue with the incorrect chunking as well. =20 Regards, Dennis =20 -----Original Message-----=20 From: Masanz, James J.=20 Sent: Thursday, August 29, 2013 8:38 AM=20 To: 'user@ctakes.apache.org'=20 Subject: RE: Concept annotation questions=20 =20 I created JIRA issue CTAKES-231 for this as the code in trunk and in the = cTAKES 3.1 branch also get the chunking wrong. https://issues.apache.org/jira/browse/CTAKES-231 =20 Thanks, -- James =20 From: user-return-258-Masanz.James=3Dmayo.edu@ctakes.apache.org = [mailto:user-return-258-Masanz.James=3Dmayo.edu@ctakes.apache.org] On = Behalf Of Masanz, James J. Sent: Thursday, August 29, 2013 9:19 AM To: 'user@ctakes.apache.org' Subject: RE: Concept annotation questions =20 Hi Dennis, =20 Thanks for explaining why you are interested in finding out which words = in the original text cause a particular concept to be annotated. We are = currently working on getting Apache cTAKES 3.1 out. Depending on your = timeline, after that is done, perhaps I could create a patch for you = that would help with determining which words from the text matched a = dictionary entry, rather than just the begin offset of the first word = and the end offset of the last word. =20 As far as the chunking, the fact =E2=80=9Cliver=E2=80=9D and = =E2=80=9Cand=E2=80=9D are being tagged as O-chunks explains why the = dictionary lookup component is not finding liver cancer or lung cancer = in =E2=80=9Ccancer of colon, liver and lung=E2=80=9D =20 I=E2=80=99ll try that sentence with the latest chunker model (which will = be in cTAKES 3.1) and see if it assigns correct chunk tags for that = sentence. =20 -- James =20 From: user-return-257-Masanz.James=3Dmayo.edu@ctakes.apache.org = [mailto:user-return-257-Masanz.James=3Dmayo.edu@ctakes.apache.org] On = Behalf Of Dennis Lee Hon Kit Sent: Wednesday, August 28, 2013 2:33 PM To: user@ctakes.apache.org Subject: Re: Concept annotation questions =20 Hi James & Pei, =20 Thank you for your replies and sorry for my late reply as I have been = away. =20 Q1 =E2=80=93 The longest span could work and is one of the options we = are looking at but when there are overlaps it can get complicated. In = the following example, the longest would work. We can take start with = 01, and ignore 02 and 03 because their start positions overlap the end = position of 01, and then continue with 04. But I don=E2=80=99t think it = will always be this straight forward as the being/end string positions = may not always be a good indicator of what exactly in the original text = was coded. =20 00 Invasive ductal carcinoma of the left breast with bone metastases. 01 Invasive ductal carcinoma of the left breast = 408643008|Infiltrating duct carcinoma of breast (disorder)| 02 breast with bone = 56873002|Bone structure of sternum (body structure)| 03 breast with bone metastases = 94297009|Secondary malignant neoplasm of female breast (disorder)| 04 bone metastases = 94222008|Secondary malignant neoplasm of bone (disorder)| =20 Q2 =E2=80=93 As we are beginners, we are not at the level where we are = comfortable with modifying cTakes or even know where to begin modifying = cTakes but that would be an option in the future. Going back to the = example of =E2=80=9Ccancer of liver=E2=80=9D and using the begin/end = position of the string that was used to identify the concept, the = original string would be =E2=80=9Ccancer of colon, lung and = liver.=E2=80=9D The CUI that was identified was C0345904, which has 209 = (137 unique) descriptions for all languages. Examples of English terms = include: =E2=80=A2 CA - Liver cancer=20 =E2=80=A2 Cancer of Liver=20 =E2=80=A2 cancer of the liver=20 =E2=80=A2 Cancer, Hepatic=20 =E2=80=A2 CANCER, HEPATOCELLULAR=20 =E2=80=A2 Malignant hepatic neoplasm=20 =E2=80=A2 Malignant liver tumor=20 =E2=80=A2 Malignant liver tumour=20 =E2=80=A2 Malignant neoplasm of liver=20 =E2=80=A2 malignant neoplasm of liver (diagnosis)=20 =E2=80=A2 Malignant neoplasm of liver unspecified=20 =E2=80=A2 Malignant neoplasm of liver unspecified (disorder)=20 =E2=80=A2 Malignant neoplasm of liver, not specified as primary or = secondary=20 =E2=80=A2 Malignant neoplasm of liver, NOS=20 =E2=80=A2 Malignant neoplasm of liver, unspecified=20 =E2=80=A2 malignant neosplasm of the liver=20 =E2=80=A2 Malignant tumor of liver=20 =E2=80=A2 Malignant tumor of liver (disorder)=20 =E2=80=A2 Malignant tumour of liver It would seem suboptimal to go through each of the descriptions to try = and determine which was the UMLS term that was used in the coding. It = is important for us to know which part of the string is matched because = something like =E2=80=9CInvasive ductal carcinoma of the left = breast=E2=80=9D will be matched to the SNOMED CT concept = =E2=80=9C408643008|Infiltrating duct carcinoma of breast = (disorder)|=E2=80=9D, but we would like to know that = =E2=80=9Cleft=E2=80=9D was not matched and would like to post-coordinate = the expression to indicate the left breast, i.e.: 408643008|Infiltrating = duct carcinoma of breast (disorder)|:363698007|Finding site = (attribute)|=3D80248007|Left breast structure (body structure)|. When = there are other qualifiers like severity, chronicity and episodicity = that may be ignored when matching, we would like to capture it at the = level of granularity specified in the original text. =20 In terms of the chunking, here is what I see for =E2=80=9Ccancer of = colon, lung and liver=E2=80=9D: =E2=80=A2 NP: cancer of colon, lung and liver=20 =E2=80=A2 PP: of=20 =E2=80=A2 NP: colon, lung and liver For =E2=80=9Ccancer of colon, liver and lung=E2=80=9D here is what I = see: =E2=80=A2 NP: cancer of colon,=20 =E2=80=A2 PP: of=20 =E2=80=A2 NP: colon=20 =E2=80=A2 O: liver=20 =E2=80=A2 O: and=20 =E2=80=A2 NP: lung Q3 =E2=80=93 To answer Pei=E2=80=99s question, we are not looking at the = preferred name from the UMLS, just which term was used. =20 Regards, Dennis =20 From: Chen, Pei=20 Sent: Thursday, August 22, 2013 12:27 PM To: user@ctakes.apache.org=20 Subject: RE: Concept annotation questions =20 Also, > 3)=E2=80=A6 or the exact description that was returned in the UMLS?=20 I presume you mean to save the preferred name from UMLS? If so, this = seems to be a common request- see: = https://issues.apache.org/jira/browse/CTAKES-224 =20 --Pei =20 From: Masanz, James J. [mailto:Masanz.James@mayo.edu]=20 Sent: Thursday, August 22, 2013 3:24 PM To: 'user@ctakes.apache.org' Subject: RE: Concept annotation questions =20 =20 Welcome to the cTAKES community. =20 Q1 =E2=80=93 some people use the longest span.=20 Q2 &Q3 =E2=80=93 can you just use the text from the dictionary = =E2=80=9CMalignant neoplasm of liver (disorder)=E2=80=9C. Alternatively = you could modify cTAKES to save the text of the words that it matches = when it is performing dictionary lookup. I would guess there is a term = in the UMLS dictionary with the same code as Malignant neoplasm of liver = (disorder) that just has the words =E2=80=9Ccancer of liver=E2=80=9D, = but there isn=E2=80=99t anything in cTAKES to give that to you just = through a configuration change. =20 For =E2=80=9Ccancer of colon, liver and lung=E2=80=9C, can you look at = the chunk tag for liver. If it=E2=80=99s in a separate noun phrase = (NP) from =E2=80=9Ccancer of colon=E2=80=9D that would account for why = cancer is not getting tied to liver in that case (but wouldn=E2=80=99t = account for why the chunker is creating as a separate noun phrase) =20 -- James =20 From: user-return-248-Masanz.James=3Dmayo.edu@ctakes.apache.org = [mailto:user-return-248-Masanz.James=3Dmayo.edu@ctakes.apache.org] On = Behalf Of Dennis Lee Hon Kit Sent: Wednesday, August 21, 2013 1:10 PM To: user@ctakes.apache.org Subject: Concept annotation questions =20 Hi Everyone, =20 We are new to cTakes so please bear with our questions. We are using = cTakes to annotate things like encounter diagnoses and referral notes = and are especially interested with the SNOMED CT encodings. But we are = not sure how to make sense of all the outputs. =20 Example #1 =20 In the example below, =E2=80=9Ccancer of colon, lung and liver=E2=80=9D = has been encoded with SNOMED CT and additional concepts that do not = apply have been removed (e.g., general =E2=80=9Ccancer=E2=80=9D concept, = lung, colon and liver structures, etc). They have been plotted out by = the begin/end positions. If the terms to do not align, its probably = because the email only accepts plain text and a mono-spaced font is not = the default. =20 cancer of colon, lung and liver cancer of colon, lung and liver 93870000|Malignant neoplasm of liver = (disorder)| cancer of colon, lung 363358000|Malignant tumor of lung = (disorder)| cancer of colon 363406005|Malignant tumor of colon = (disorder)| =20 Question (1) =E2=80=93 We had to do quite a bit of post-processing to = remove inactive concepts, subtype concepts, concepts that are part of = the defining attributes, etc. Are there a set of guidelines to help = sort out the CUI or SNOMED CT codes that have been identified? Question (2) =E2=80=93 How can we determine that = =E2=80=9C93870000|Malignant neoplasm of liver (disorder)|=E2=80=9D = refers to =E2=80=9Ccancer of liver=E2=80=9D as opposed to using the = begin/end string, which points to =E2=80=9Ccancer of colon, lung and = liver=E2=80=9D? Certainly we can try to do additional parsing but there = are a lot of different scenarios to take into account. Question (3) =E2=80=93 This relates to question 2, are we able to = identify the original terms that were used for the concept matching or = the exact description that was returned in the UMLS? While the CUI is = helpful, the CUI can refer to tens or even hundreds of descriptions. =20 ________________________________________ Example #2 =20 Switching the position of colon, lung and liver can result in different = encodings. Once again, after removing additional concepts not needed = (i.e., =E2=80=9Ccancer=E2=80=9D and =E2=80=9Ccolon structure=E2=80=9D), = we get the following. What happened to liver and lung cancer? =20 cancer of colon, liver and lung cancer of colon 363406005|Malignant tumor of colon = (disorder)| lung 39607008|Lung structure (body = structure)| =20 We have more questions but will start with these. Thank you in advance. =20 Regards, Dennis ------=_NextPart_000_00DD_01CEAE0F.31D331F0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable
Hi James,
 
Thank you for your email.  We are currently using cTakes 3.0 = but will=20 upgrade to which ever version you issue the patch for.  Thank you = for=20 taking the time out of your busy schedule to work on the patch.
 
Regards,
Dennis
 
Sent: Monday, September 09, 2013 7:44 AM
Subject: RE: Concept annotation questions
 

Which=20 version of cTAKES are you using or planning to = use.

 

cTAKES=20 3.1 has been approved and once the apache.org infrastructure team does = some=20 administrative-like tasks the process of having the apache mirrors = updated with=20 3.1 should start.

 

I=20 want to target the release that will be most useful for you for this = patch=20 first.

 

From:=20 user-return-267-Masanz.James=3Dmayo.edu@ctakes.apache.org=20 [mailto:user-return-267-Masanz.James=3Dmayo.edu@ctakes.apache.org] On = Behalf Of=20 Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11=20 AM
To: user@ctakes.apache.org
Subject: Re: Concept=20 annotation questions

 

Hi=20 James,

 

Thank=20 you for your reply.

 

If=20 you could create the patch for identifying the words used in the = matching that=20 would be great.  We understand you have other priorities and will = wait=20 until you have time to do it.

 

Thank=20 you for logging the issue with the incorrect chunking as=20 well.

 

Regards,

Dennis

 

-----Original=20 Message-----

From:=20 Masanz, James J.

Sent:=20 Thursday, August 29, 2013 8:38 AM

To:=20 'user@ctakes.apache.org'

Subject:=20 RE: Concept annotation questions

 

I=20 created JIRA issue CTAKES-231 for this as the code in trunk and in the = cTAKES=20 3.1 branch also get the chunking wrong.

https://issues.= apache.org/jira/browse/CTAKES-231

 

Thanks,

--=20 James

 

From:=20 user-return-258-Masanz.James=3Dmayo.edu@ctakes.apache.org=20 [mailto:user-return-258-Masanz.James=3Dmayo.edu@ctakes.apache.org]=20 On Behalf Of Masanz, James J.

Sent:=20 Thursday, August 29, 2013 9:19 AM

To:=20 'user@ctakes.apache.org'

Subject:=20 RE: Concept annotation questions

 

Hi=20 Dennis,

 

Thanks=20 for explaining why you are interested in finding out which words in the = original=20 text cause a particular concept to be annotated.  We are currently = working=20 on getting Apache cTAKES 3.1 out.  Depending on your timeline, = after that=20 is done, perhaps I could create a patch for you that would help with = determining=20 which words from the text matched a dictionary entry, rather than just = the begin=20 offset of the first word and the end offset of the last=20 word.

 

As=20 far as the chunking, the fact =E2=80=9Cliver=E2=80=9D and = =E2=80=9Cand=E2=80=9D are being tagged as O-chunks=20 explains why the dictionary lookup component is not finding liver cancer = or lung=20 cancer in =E2=80=9Ccancer of colon, liver and = lung=E2=80=9D

 

I=E2=80=99ll=20 try that sentence with the latest chunker model (which will be in cTAKES = 3.1)=20 and see if it assigns correct chunk tags for that=20 sentence.

 

--=20 James

 

From:=20 user-return-257-Masanz.James=3Dmayo.edu@ctakes.apache.org=20 [mailto:user-return-257-Masanz.James=3Dmayo.edu@ctakes.apache.org]=20 On Behalf Of Dennis Lee Hon Kit

Sent:=20 Wednesday, August 28, 2013 2:33 PM

To:=20 user@ctakes.apache.org

Subject:=20 Re: Concept annotation questions

 

Hi=20 James & Pei,

 

Thank=20 you for your replies and sorry for my late reply as I have been=20 away.

 

Q1 =E2=80=93=20 The longest span could work and is one of the options we are looking at = but when=20 there are overlaps it can get complicated.  In the following = example, the=20 longest would work.  We can take start with 01, and ignore 02 and = 03=20 because their start positions overlap the end position of 01, and then = continue=20 with 04.  But I don=E2=80=99t think it will always be this straight = forward as the=20 being/end string positions may not always be a good indicator of what = exactly in=20 the original text was coded.

 

00=20 Invasive ductal carcinoma of the left breast with bone=20 metastases.

01=20 Invasive ductal carcinoma of the left=20 breast           &= nbsp;          =20 408643008|Infiltrating duct carcinoma of breast=20 (disorder)|

02           = ;            =             &= nbsp;  =20 breast with=20 bone           &nb= sp;=20 56873002|Bone structure of sternum (body = structure)|

03           = ;            =             &= nbsp;  =20 breast with bone metastases  94297009|Secondary malignant neoplasm = of=20 female breast (disorder)|

04           = ;            =             &= nbsp;           &n= bsp;  =20 bone metastases  94222008|Secondary malignant neoplasm of bone=20 (disorder)|

 

Q2 =E2=80=93=20 As we are beginners, we are not at the level where we are comfortable = with=20 modifying cTakes or even know where to begin modifying cTakes but that = would be=20 an option in the future.  Going back to the example of = =E2=80=9Ccancer of liver=E2=80=9D=20 and using the begin/end position of the string that was used to identify = the=20 concept, the original string would be =E2=80=9Ccancer of colon, lung and = liver.=E2=80=9D =20 The CUI that was identified was C0345904, which has 209 (137 unique)=20 descriptions for all languages.  Examples of English terms=20 include:

=E2=80=A2 CA=20 - Liver cancer

=E2=80=A2=20 Cancer of Liver

=E2=80=A2=20 cancer of the liver

=E2=80=A2=20 Cancer, Hepatic

=E2=80=A2=20 CANCER, HEPATOCELLULAR

=E2=80=A2=20 Malignant hepatic neoplasm

=E2=80=A2=20 Malignant liver tumor

=E2=80=A2=20 Malignant liver tumour

=E2=80=A2=20 Malignant neoplasm of liver

=E2=80=A2=20 malignant neoplasm of liver (diagnosis)

=E2=80=A2=20 Malignant neoplasm of liver unspecified

=E2=80=A2=20 Malignant neoplasm of liver unspecified (disorder) =

=E2=80=A2=20 Malignant neoplasm of liver, not specified as primary or secondary=20

=E2=80=A2=20 Malignant neoplasm of liver, NOS

=E2=80=A2=20 Malignant neoplasm of liver, unspecified

=E2=80=A2=20 malignant neosplasm of the liver

=E2=80=A2=20 Malignant tumor of liver

=E2=80=A2=20 Malignant tumor of liver (disorder)

=E2=80=A2=20 Malignant tumour of liver

It=20 would seem suboptimal to go through each of the descriptions to try and=20 determine which was the UMLS term that was used in the coding.  It = is=20 important for us to know which part of the string is matched because = something=20 like =E2=80=9CInvasive ductal carcinoma of the left breast=E2=80=9D will = be matched to the=20 SNOMED CT concept =E2=80=9C408643008|Infiltrating duct carcinoma of = breast (disorder)|=E2=80=9D,=20 but we would like to know that =E2=80=9Cleft=E2=80=9D was not matched = and would like to=20 post-coordinate the expression to indicate the left breast, i.e.:=20 408643008|Infiltrating duct carcinoma of breast = (disorder)|:363698007|Finding=20 site (attribute)|=3D80248007|Left breast structure (body = structure)|.  When=20 there are other qualifiers like severity, chronicity and episodicity = that may be=20 ignored when matching, we would like to capture it at the level of = granularity=20 specified in the original text.

 

In=20 terms of the chunking, here is what I see for =E2=80=9Ccancer of colon, = lung and=20 liver=E2=80=9D:

=E2=80=A2 NP:=20 cancer of colon, lung and liver

=E2=80=A2 PP:=20 of

=E2=80=A2 NP:=20 colon, lung and liver

For=20 =E2=80=9Ccancer of colon, liver and lung=E2=80=9D here is what I=20 see:

=E2=80=A2 NP:=20 cancer of colon,

=E2=80=A2 PP:=20 of

=E2=80=A2 NP:=20 colon

=E2=80=A2 O:=20 liver

=E2=80=A2 O:=20 and

=E2=80=A2 NP:=20 lung

Q3 =E2=80=93=20 To answer Pei=E2=80=99s question, we are not looking at the preferred = name from the=20 UMLS, just which term was used.

 

Regards,

Dennis

 

From:=20 Chen, Pei

Sent:=20 Thursday, August 22, 2013 12:27 PM

Subject:=20 RE: Concept annotation questions

 

Also,

>=20 3)=E2=80=A6 or the exact description that was returned in the UMLS?=20

I=20 presume you mean to save the preferred name from UMLS?  If so, this = seems=20 to be a common request- see: https://issues.= apache.org/jira/browse/CTAKES-224

 

--Pei

 

From:=20 Masanz, James J. [mailto:Masanz.James@mayo.edu]=20

Sent:=20 Thursday, August 22, 2013 3:24 PM

To:=20 'user@ctakes.apache.org'

Subject:=20 RE: Concept annotation questions

 

 

Welcome=20 to the cTAKES community.

 

Q1 =E2=80=93=20 some people use the longest span.

Q2=20 &Q3 =E2=80=93 can you just use the text from the dictionary = =E2=80=9CMalignant neoplasm of=20 liver (disorder)=E2=80=9C.  Alternatively you could modify cTAKES = to save the text=20 of the words that it matches when it is performing dictionary lookup. I = would=20 guess there is a term in the UMLS dictionary with the same code as = Malignant=20 neoplasm of liver (disorder) that just has the words =E2=80=9Ccancer of = liver=E2=80=9D, but=20 there isn=E2=80=99t anything in cTAKES to give that to you just through = a configuration=20 change.

 

For=20 =E2=80=9Ccancer of colon, liver and lung=E2=80=9C, can you look at the = chunk  tag for=20 liver.  If it=E2=80=99s in a separate noun phrase (NP) from = =E2=80=9Ccancer of colon=E2=80=9D that=20 would account for why cancer is not getting tied to liver in that case = (but=20 wouldn=E2=80=99t account for why the chunker is creating as a separate = noun=20 phrase)

 

--=20 James

 

From:=20 user-return-248-Masanz.James=3Dmayo.edu@ctakes.apache.org=20 [mailto:user-return-248-Masanz.James=3Dmayo.edu@ctakes.apache.org]=20 On Behalf Of Dennis Lee Hon Kit

Sent:=20 Wednesday, August 21, 2013 1:10 PM

To:=20 user@ctakes.apache.org

Subject:=20 Concept annotation questions

 

Hi=20 Everyone,

 

We=20 are new to cTakes so please bear with our questions.  We are using = cTakes=20 to annotate things like encounter diagnoses and referral notes and are=20 especially interested with the SNOMED CT encodings.  But we are not = sure=20 how to make sense of all the outputs.

 

Example=20 #1

 

In=20 the example below, =E2=80=9Ccancer of colon, lung and liver=E2=80=9D has = been encoded with=20 SNOMED CT and additional concepts that do not apply have been removed = (e.g.,=20 general =E2=80=9Ccancer=E2=80=9D concept, lung, colon and liver = structures, etc).  =20 They have been plotted out by the begin/end positions.  If the = terms to do=20 not align, its probably because the email only accepts plain text and a=20 mono-spaced font is not the default.

 

cancer=20 of colon, lung and liver

cancer=20 of colon, lung and liver   93870000|Malignant neoplasm of = liver=20 (disorder)|

cancer=20 of colon,=20 lung           &nb= sp;=20 363358000|Malignant tumor of lung = (disorder)|

cancer=20 of=20 colon           &n= bsp;      =20 363406005|Malignant tumor of colon = (disorder)|

 

Question=20 (1) =E2=80=93 We had to do quite a bit of post-processing to remove = inactive concepts,=20 subtype concepts, concepts that are part of the defining attributes, = etc. =20 Are there a set of guidelines to help sort out the CUI or SNOMED CT = codes that=20 have been identified?

Question=20 (2) =E2=80=93 How can we determine that =E2=80=9C93870000|Malignant = neoplasm of liver=20 (disorder)|=E2=80=9D refers to =E2=80=9Ccancer of liver=E2=80=9D as = opposed to using the begin/end=20 string, which points to =E2=80=9Ccancer of colon, lung and = liver=E2=80=9D?  Certainly we=20 can try to do additional parsing but there are a lot of different = scenarios to=20 take into account.

Question=20 (3) =E2=80=93 This relates to question 2, are we able to identify the = original terms=20 that were used for the concept matching or the exact description that = was=20 returned in the UMLS?  While the CUI is helpful, the CUI can refer = to tens=20 or even hundreds of descriptions.

 

________________________________________

Example=20 #2

 

Switching=20 the position of colon, lung and liver can result in different = encodings. =20 Once again, after removing additional concepts not needed (i.e., = =E2=80=9Ccancer=E2=80=9D and=20 =E2=80=9Ccolon structure=E2=80=9D), we get the following.  What = happened to liver and lung=20 cancer?

 

cancer=20 of colon, liver and lung

cancer=20 of=20 colon           &n= bsp;      =20 363406005|Malignant tumor of colon = (disorder)|

           &= nbsp;           &n= bsp;  =20 lung   39607008|Lung structure (body=20 structure)|

 

We=20 have more questions but will start with these.  Thank you in=20 advance.

 

Regards,

Dennis

------=_NextPart_000_00DD_01CEAE0F.31D331F0--