ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: Concept annotation questions
Date Thu, 29 Aug 2013 17:07:58 GMT
You may be able to use the JCasUtil class from Uimafit to do something 
like the following:

for each IdentifiedAnnotation i:
     List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)

(this is java-ish pseudocode obviously). Then the list you get of tokens 
will all have the same type as the IdentifiedAnnotation i. Would that 
solve your problem?

On 08/29/2013 12:29 PM, samir chabou wrote:
> Hi James and Pei,
> I also need to know what is the medical type (Sympto, Drug , 
> procedure, relation) of a given word token. Since in the typeystem 
> hierarchy wordtoken is not under the same inheritance tree than 
> identifiedAnnotation . I’m currently iterating on all wordTokens and 
> compare each wordToken.CoveredText to the annotations.CovredText in 
> the identifiedAnnotation. I found this a long process. James, do you 
> think the patch <<I could create a patch for you that would help with 
> determining which words from the text matched a dictionary entry >> 
> that you are planning to create will permit also this requirement ? or 
> can you suggest me some thing better than I’m currently doing.
> Thanks
> Samir
> ------------------------------------------------------------------------
> *From:* "Masanz, James J." <Masanz.James@mayo.edu>
> *To:* "'user@ctakes.apache.org'" <user@ctakes.apache.org>
> *Sent:* Thursday, August 29, 2013 10:18:40 AM
> *Subject:* RE: Concept annotation questions
> Hi Dennis,
> Thanks for explaining why you are interested in finding out which 
> words in the original text cause a particular concept to be 
> annotated.  We are currently working on getting Apache cTAKES 3.1 
> out.  Depending on your timeline, after that is done, perhaps I could 
> create a patch for you that would help with determining which words 
> from the text matched a dictionary entry, rather than just the begin 
> offset of the first word and the end offset of the last word.
> As far as the chunking, the fact “liver” and “and” are being tagged as 
> O-chunks explains why the dictionary lookup component is not finding 
> liver cancer or lung cancer in “cancer of colon, liver and lung”
> I’ll try that sentence with the latest chunker model (which will be in 
> cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
> -- James
> *From:*user-return-257-Masanz.James=mayo.edu@ctakes.apache.org 
> [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] *On 
> Behalf Of *Dennis Lee Hon Kit
> *Sent:* Wednesday, August 28, 2013 2:33 PM
> *To:* user@ctakes.apache.org
> *Subject:* Re: Concept annotation questions
> Hi James & Pei,
> Thank you for your replies and sorry for my late reply as I have been 
> away.
> Q1 – The longest span could work and is one of the options we are 
> looking at but when there are overlaps it can get complicated.  In the 
> following example, the longest would work.  We can take start with 01, 
> and ignore 02 and 03 because their start positions overlap the end 
> position of 01, and then continue with 04.  But I don’t think it will 
> always be this straight forward as the being/end string positions may 
> not always be a good indicator of what exactly in the original text 
> was coded.
> *00 Invasive ductal carcinoma of the left breast with bone metastases.*
> 01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating 
> duct carcinoma of breast (disorder)|
> 02 breast with bone             56873002|Bone structure of sternum 
> (body structure)|
> 03 breast with bone metastases 94297009|Secondary malignant neoplasm 
> of female breast (disorder)|
> 04 bone metastases  94222008|Secondary malignant neoplasm of bone 
> (disorder)|
> Q2 – As we are beginners, we are not at the level where we are 
> comfortable with modifying cTakes or even know where to begin 
> modifying cTakes but that would be an option in the future.  Going 
> back to the example of “cancer of liver” and using the begin/end 
> position of the string that was used to identify the concept, the 
> original string would be “cancer of colon, lung and liver.”  The CUI 
> that was identified was C0345904, which has 209 (137 unique) 
> descriptions for all languages.  Examples of English terms include:
>   * CA - Liver cancer
>   * Cancer of Liver
>   * cancer of the liver
>   * Cancer, Hepatic
>   * Malignant hepatic neoplasm
>   * Malignant liver tumor
>   * Malignant liver tumour
>   * Malignant neoplasm of liver
>   * malignant neoplasm of liver (diagnosis)
>   * Malignant neoplasm of liver unspecified
>   * Malignant neoplasm of liver unspecified (disorder)
>   * Malignant neoplasm of liver, not specified as primary or secondary
>   * Malignant neoplasm of liver, NOS
>   * Malignant neoplasm of liver, unspecified
>   * malignant neosplasm of the liver
>   * Malignant tumor of liver
>   * Malignant tumor of liver (disorder)
>   * Malignant tumour of liver
> It would seem suboptimal to go through each of the descriptions to try 
> and determine which was the UMLS term that was used in the coding.  It 
> is important for us to know which part of the string is matched 
> because something like “Invasive ductal carcinoma of the left breast” 
> will be matched to the SNOMED CT concept “408643008|Infiltrating duct 
> carcinoma of breast (disorder)|”, but we would like to know that 
> “left” was not matched and would like to post-coordinate the 
> expression to indicate the left breast, i.e.: 408643008|Infiltrating 
> duct carcinoma of breast (disorder)|:363698007|Finding site 
> (attribute)|=80248007|Left breast structure (body structure)|.  When 
> there are other qualifiers like severity, chronicity and episodicity 
> that may be ignored when matching, we would like to capture it at the 
> level of granularity specified in the original text.
> In terms of the chunking, here is what I see for “cancer of colon, 
> lung and liver”:
>   * NP: cancer of colon, lung and liver
>   * PP: of
>   * NP: colon, lung and liver
> For “cancer of colon, liver and lung” here is what I see:
>   * NP: cancer of colon,
>   * PP: of
>   * NP: colon
>   * O: liver
>   * O: and
>   * NP: lung
> Q3 – To answer Pei’s question, we are not looking at the preferred 
> name from the UMLS, just which term was used.
> Regards,
> Dennis
> *From:*Chen, Pei <mailto:Pei.Chen@childrens.harvard.edu>
> *Sent:*Thursday, August 22, 2013 12:27 PM
> *To:*user@ctakes.apache.org <mailto:user@ctakes.apache.org>
> *Subject:*RE: Concept annotation questions
> Also,
> >3)… or the exact description that was returned in the UMLS?
> I presume you mean to save the preferred name from UMLS?  If so, this 
> seems to be a common request- 
> see:https://issues.apache.org/jira/browse/CTAKES-224
> --Pei
> *From:*Masanz, James J. [mailto:Masanz.James@mayo.edu]
> *Sent:* Thursday, August 22, 2013 3:24 PM
> *To:* 'user@ctakes.apache.org'
> *Subject:* RE: Concept annotation questions
> Welcome to the cTAKES community.
> Q1 – some people use the longest span.
> Q2 &Q3 – can you just use the text from the dictionary “Malignant 
> neoplasm of liver (disorder)“. Alternatively you could modify cTAKES 
> to save the text of the words that it matches when it is performing 
> dictionary lookup. I would guess there is a term in the UMLS 
> dictionary with the same code as Malignant neoplasm of liver 
> (disorder) that just has the words “cancer of liver”, but there isn’t 
> anything in cTAKES to give that to you just through a configuration 
> change.
> For “*cancer of colon, liver and lung*“, can you look at the chunk  
> tag for liver.  If it’s in a separate noun phrase (NP) from “cancer of 
> colon” that would account for why cancer is not getting tied to liver 
> in that case (but wouldn’t account for why the chunker is creating as 
> a separate noun phrase)
> -- James
> *From:*user-return-248-Masanz.James=mayo.edu@ctakes.apache.org 
> <mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org> 
> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] *On 
> Behalf Of *Dennis Lee Hon Kit
> *Sent:* Wednesday, August 21, 2013 1:10 PM
> *To:* user@ctakes.apache.org <mailto:user@ctakes.apache.org>
> *Subject:* Concept annotation questions
> Hi Everyone,
> We are new to cTakes so please bear with our questions.  We are using 
> cTakes to annotate things like encounter diagnoses and referral notes 
> and are especially interested with the SNOMED CT encodings.  But we 
> are not sure how to make sense of all the outputs.
> *Example #1*
> In the example below, “cancer of colon, lung and liver” has been 
> encoded with SNOMED CT and additional concepts that do not apply have 
> been removed (e.g., general “cancer” concept, lung, colon and liver 
> structures, etc).   They have been plotted out by the begin/end 
> positions.  If the terms to do not align, its probably because the 
> email only accepts plain text and a mono-spaced font is not the default.
> *cancer of colon, lung and liver*
> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver 
> (disorder)|
> cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
> cancer of colon 363406005|Malignant tumor of colon (disorder)|
> Question (1) – We had to do quite a bit of post-processing to remove 
> inactive concepts, subtype concepts, concepts that are part of the 
> defining attributes, etc. Are there a set of guidelines to help sort 
> out the CUI or SNOMED CT codes that have been identified?
> Question (2) – How can we determine that “93870000|Malignant neoplasm 
> of liver (disorder)|” refers to “cancer of liver” as opposed to using 
> the begin/end string, which points to “cancer of colon, lung and 
> liver”?  Certainly we can try to do additional parsing but there are a 
> lot of different scenarios to take into account.
> Question (3) – This relates to question 2, are we able to identify the 
> original terms that were used for the concept matching or the exact 
> description that was returned in the UMLS?  While the CUI is helpful, 
> the CUI can refer to tens or even hundreds of descriptions.
> ------------------------------------------------------------------------
> *Example #2*
> Switching the position of colon, lung and liver can result in 
> different encodings.  Once again, after removing additional concepts 
> not needed (i.e., “cancer” and “colon structure”), we get the 
> following.  What happened to liver and lung cancer?
> *cancer of colon, liver and lung*
> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>                            lung 39607008|Lung structure (body structure)|
> We have more questions but will start with these.  Thank you in advance.
> Regards,
> Dennis

View raw message