ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From samir chabou <samir...@yahoo.com>
Subject Re: Concept annotation questions and keep JCas results in a file
Date Sat, 07 Sep 2013 21:48:07 GMT
mucha gracias Pei, that helps to know.
Samir




________________________________
 From: Pei Chen <chenpei@apache.org>
To: user@ctakes.apache.org; samir chabou <samirchb@yahoo.com> 
Sent: Saturday, September 7, 2013 11:38:11 AM
Subject: Re: Concept annotation questions and keep JCas results in a file
 


Samir,
xcas will eventually be deprecated/replaced with the preferred/more compact xmi format--

/*
 *******************************************************************************************
 * N O T E :     The XML format (XCAS) that this Cas Consumer outputs, 
is eventually
 *               being superceeded by the more standardized and compact 
XMI format.  However
 *               it is used currently as the expected form for remote 
services, and there is
 *               existing tooling for doing stand-alone component 
development and debugging
 *               that uses this format to populate an initial CAS.  So 
it is not
 *               deprecated yet;  it is also being kept for 
compatibility with older versions.
 *              
 *               New code should consider using the XmiWriterCasConsumer 
where possible,
 *               which uses the current XMI format for XML 
externalizations of the CAS
 *******************************************************************************************
 */




On Fri, Sep 6, 2013 at 11:34 PM, samir chabou <samirchb@yahoo.com> wrote:

Hi Richard,
>I had a look to these methods they can allow me to implement my requirement. Do you have
an idea if there is a preferrence of using readXCas/writeXCas rather than readXmi/writeXmi
or it is just a matter of having different possibilities of read/write from/to different file
format.
>Thanks
>Samir
>
>
>
>
>
>
>________________________________
> From: Richard Eckart de Castilho <rec@apache.org>
>To: user@ctakes.apache.org; samir chabou <samirchb@yahoo.com> 
>Sent: Friday, September 6, 2013 3:29:19 AM
>Subject: Re: Concept annotation questions and keep JCas results in a file
> 
>
>Hi,
>
>you might want to take a look at convenience methods in the recently
>released Apache uimaFIT 2.0.0:
>
>CasIOUtil
>  readXCas(JCas, File)
>  readXmi(JCas, File)
>  writeXCas(JCas, File)
>  writeXmi(JCas, File)
>
>Cheers,
>
>-- Richard
>
>On 06.09.2013, at 06:28, samir chabou <samirchb@yahoo.com> wrote:
>
>> Hi Tim, Pei and James
>> 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it answer perfectly
my requirement, thanks Tim. 
>> 2) Now; I need to  NLP a medical question using the clinical pipeline and I need
to keep the
 JCas result in a file or any persistent way because i need to use it later in my processing.
Is it possible to do this and is it possible to recall this  JCas later in my processing
?    
>> 
>> Thanks 
>> Samir
>> From: samir chabou <samirchb@yahoo.com>
>> To: "user@ctakes.apache.org" <user@ctakes.apache.org> 
>> Sent: Thursday, August 29, 2013 2:51:12 PM
>> Subject: Re: Concept annotation questions
>> 
>> Thanks Tim,
>> it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas,
BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations.
If my base token is in the list so
 the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and
let you know. 
>> Thanks 
>> Samir
>> 
>> 
>> From: Tim Miller <timothy.miller@childrens.harvard.edu>
>> To: user@ctakes.apache.org 
>> Sent: Thursday, August 29, 2013 1:07:58 PM
>> Subject: Re: Concept annotation questions
>> 
>> Samir,
>> You may be able to use the JCasUtil class from Uimafit to do something like the following:
>> 
>> for each IdentifiedAnnotation i:
>>     List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
>> 
>> 
>> (this is java-ish pseudocode obviously). Then the list you get of tokens will all
have the same type as the IdentifiedAnnotation i.
 Would that solve your problem?
>> Tim
>> 
>> On 08/29/2013 12:29 PM, samir chabou wrote:
>>> Hi James and Pei,
>>> I also need to know what is the medical type (Sympto, Drug , procedure, relation)
of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance
tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each
wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this
a long process. James, do you think the patch  <<I could create a patch for you that
would help with determining which words from the text matched a dictionary entry >>
that you are planning to create will permit also this requirement ? or can you suggest me
some thing better than I’m currently doing.
>>>  
>>> Thanks
>>> Samir  
>>> 
>>> From: "Masanz, James J." <Masanz.James@mayo.edu>
>>> To: "'user@ctakes.apache.org'" <user@ctakes.apache.org> 
>>> Sent: Thursday, August 29, 2013 10:18:40 AM
>>> Subject: RE: Concept annotation questions
>>> 
>>> Hi Dennis,
>>>  
>>> Thanks for explaining why you are interested in finding out which words in the
original text cause a particular concept to be annotated.  We are currently working on getting
Apache cTAKES 3.1 out.  Depending on your timeline, after that is done, perhaps I could create
a patch for you that would help with determining which words from the text matched a dictionary
entry, rather than just the begin offset of the first word and the
 end offset of the last word.
>>>  
>>> As far as the chunking, the fact “liver” and “and” are being tagged as
O-chunks explains why the dictionary lookup component is not finding liver cancer or lung
cancer in “cancer of colon, liver and lung”
>>>  
>>> I’ll try that sentence with the latest chunker model (which will be in cTAKES
3.1) and see if it assigns correct chunk tags for that sentence.
>>>  
>>> -- James
>>>  
>>> From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Dennis Lee Hon Kit
>>> Sent: Wednesday, August 28, 2013 2:33 PM
>>> To: user@ctakes.apache.org
>>> Subject: Re: Concept annotation questions
>>>  
>>> Hi James & Pei,
>>>  
>>> Thank you for your replies and sorry for my late reply as I have been away.
>>>  
>>> Q1 – The longest span could work and is one of the options we are looking at
but when there are overlaps it can get complicated.  In the following example, the longest
would work.  We can take start with 01, and ignore 02 and 03 because their start positions
overlap the end position of 01, and then continue with 04.  But I don’t think it will always
be this straight forward as the being/end string positions may not always be a good indicator
of what exactly in the original text was coded.
>>>  
>>> 00 Invasive ductal carcinoma of the left breast with bone
 metastases.
>>> 01 Invasive ductal carcinoma of the left breast                   
   408643008|Infiltrating duct carcinoma of breast (disorder)|
>>> 02                                       breast with bone 
           56873002|Bone structure of sternum (body structure)|
>>> 03                                       breast with bone
metastases  94297009|Secondary malignant neoplasm of female breast (disorder)|
>>> 04                                                 
 bone metastases  94222008|Secondary malignant neoplasm
 of bone (disorder)|
>>>  
>>> Q2 – As we are beginners, we are not at the level where we are comfortable
with modifying cTakes or even know where to begin modifying cTakes but that would be an option
in the future.  Going back to the example of “cancer of liver” and using the begin/end
position of the string that was used to identify the concept, the original string would be
“cancer of colon, lung and liver.”  The CUI that was identified was C0345904, which has
209 (137 unique) descriptions for all languages.  Examples of English terms include:
>>>     • CA - Liver cancer
>>>     • Cancer of Liver
>>>     • cancer of the liver
>>>     • Cancer, Hepatic
>>>     • CANCER, HEPATOCELLULAR
>>>     • Malignant hepatic neoplasm
>>>     •
 Malignant liver tumor
>>>     • Malignant liver tumour
>>>     • Malignant neoplasm of liver
>>>     • malignant neoplasm of liver (diagnosis)
>>>     • Malignant neoplasm of liver unspecified
>>>     • Malignant neoplasm of liver unspecified (disorder)
>>>     • Malignant neoplasm of liver, not specified as primary or secondary
>>>     • Malignant neoplasm of liver, NOS
>>>     • Malignant neoplasm of liver, unspecified
>>>     • malignant neosplasm of the liver
>>>     • Malignant tumor of liver
>>>     • Malignant tumor of liver (disorder)
>>>     • Malignant tumour of liver
>>> It would seem suboptimal to go through each of the descriptions to try
 and determine which was the UMLS term that was used in the coding.  It is important for
us to know which part of the string is matched because something like “Invasive ductal carcinoma
of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct
carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched
and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating
duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast
structure (body structure)|.  When there are other qualifiers like severity, chronicity and
episodicity that may be ignored when matching, we would like to capture it at the level of
granularity specified in the original text.
>>>  
>>> In terms of the chunking, here is what I see for “cancer of colon, lung and
liver”:
>>>     •
 NP: cancer of colon, lung and liver
>>>     • PP: of
>>>     • NP: colon, lung and liver
>>> For “cancer of colon, liver and lung” here is what I see:
>>>     • NP: cancer of colon,
>>>     • PP: of
>>>     • NP: colon
>>>     • O: liver
>>>     • O: and
>>>     • NP: lung
>>> Q3 – To answer Pei’s question, we are not looking at the preferred name from
the UMLS, just which term was used.
>>>  
>>> Regards,
>>> Dennis
>>>  
>>> From: Chen, Pei
>>> Sent: Thursday, August 22, 2013 12:27 PM
>>> To: user@ctakes.apache.org
>>> Subject: RE: Concept annotation
 questions
>>>  
>>> Also,
>>> > 3)… or the exact description that was returned in the UMLS?
>>> I presume you mean to save the preferred name from UMLS?  If so, this seems
to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
>>>  
>>> --Pei
>>>  
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu] 
>>> Sent: Thursday, August 22, 2013 3:24 PM
>>> To: 'user@ctakes.apache.org'
>>> Subject: RE: Concept annotation questions
>>>  
>>>  
>>> Welcome to the cTAKES community.
>>>  
>>> Q1 – some people use the longest
 span.
>>> Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm
of liver (disorder)“.  Alternatively you could modify cTAKES to save the text of the words
that it matches when it is performing dictionary lookup. I would guess there is a term in
the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just
has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to
you just through a configuration change.
>>>  
>>> For “cancer of colon, liver and lung“, can you look at the chunk  tag for
liver.  If it’s in a separate noun phrase (NP) from “cancer of colon” that would account
for why cancer is not getting tied to liver in that case (but wouldn’t account for why the
chunker is creating as a separate noun phrase)
>>>  
>>> -- James
>>>  
>>> From:
 user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Dennis Lee Hon Kit
>>> Sent: Wednesday, August 21, 2013 1:10 PM
>>> To: user@ctakes.apache.org
>>> Subject: Concept annotation questions
>>>  
>>> Hi Everyone,
>>>  
>>> We are new to cTakes so please bear with our questions.  We are using cTakes
to annotate things like encounter diagnoses and referral notes and are especially interested
with the SNOMED CT encodings.  But we are not sure how to make sense of all the outputs.
>>>  
>>> Example #1
>>>  
>>> In the example below, “cancer of colon, lung and liver” has been encoded
with SNOMED CT and additional concepts that do not apply have been removed (e.g., general
“cancer” concept, lung, colon and liver structures, etc).   They have been plotted out
by the begin/end positions.  If the terms to do not align, its probably because the email
only accepts plain text and a mono-spaced font is not the default.
>>>  
>>> cancer of colon, lung and liver
>>> cancer of colon, lung and liver   93870000|Malignant neoplasm of liver (disorder)|
>>> cancer of colon, lung             363358000|Malignant tumor of lung (disorder)|
>>> cancer of colon                   363406005|Malignant tumor of colon
(disorder)|
>>>  
>>> Question (1) – We had to do quite a bit of post-processing to remove
 inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. 
Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>>> Question (2) – How can we determine that “93870000|Malignant neoplasm of
liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string,
which points to “cancer of colon, lung and liver”?  Certainly we can try to do additional
parsing but there are a lot of different scenarios to take into account.
>>> Question (3) – This relates to question 2, are we able to identify the original
terms that were used for the concept matching or the exact description that was returned in
the UMLS?  While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>>>  
>>> Example #2
>>>  
>>> Switching the position of colon, lung and liver
 can result in different encodings.  Once again, after removing additional concepts not needed
(i.e., “cancer” and “colon structure”), we get the following.  What happened to liver
and lung cancer?
>>>  
>>> cancer of colon, liver and lung
>>> cancer of colon                   363406005|Malignant tumor of colon
(disorder)|
>>>                            lung   39607008|Lung structure (body
structure)|
>>>  
>>> We have more questions but will start with these.  Thank you in advance.
>>>  
>>> Regards,
>>> Dennis
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>
>
>
Mime
View raw message