ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Savova, Guergana" <Guergana.Sav...@childrens.harvard.edu>
Subject RE: Next cTAKES release (3.1)?
Date Thu, 18 Jul 2013 14:13:42 GMT
We have 5-6 clinical notes that we got from the web (=publicly available to anyone). We can
include them as samples in the 3.1 release. We have been using these notes for demo purposes.
--Guergana

-----Original Message-----
From: Andy McMurry [mailto:mcmurry.andy@gmail.com] 
Sent: Friday, June 28, 2013 10:15 AM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

iDash and others have medical NLP datasets that could be used for ctakes "Getting Started"
examples http://idash.ucsd.edu/nlp-and-data-modeling
http://idash.ucsd.edu/nlp/umls-vm

the GOOD: iDash already includes ctakes 
the BAD: iDash references old versions ctakes and points to cabig (which is now defunct) 
 

Recommendation: we should talk to iDash, create "hello medical world" training examples, and
request iDaash point to the cTakes Apache home page. 

Disclaimer: I'm not involved with iDash 

On Jun 27, 2013, at 10:58 PM, Girivaraprasad Nambari <girinambari@gmail.com> wrote:

> Hi Vijay and Andy,
> 
> Thanks for sharing those examples.
> 
> "Trouble is, privacy requires that these examples be made up by hand"
> 
> Agree with this statement and this is very valid concern.
> 
> In "getting started examples", I think we should just have couple of 
> entries (5-10 small entries), not more than that (with explicit 
> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand 
> handcrafting these may not be easy because we are not medical domain 
> experts, but I feel worth time, because it brings in more user community.
> 
> Thank you,
> Giri
> 
> 
> 
> 
> 
> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mcmurry.andy@gmail.com>wrote:
> 
>> GREAT !
>> 
>> The i2b2 data though isn't publicly distributable, you still need to 
>> request access to it since it is "semi private"
>> 
>> 
>> On Jun 27, 2013, at 9:52 PM, vijay garla <vngarla@gmail.com> wrote:
>> 
>>> We released code on using cTAKES to annotate clinical text and SVMs 
>>> that use the annotations to classify clinical text from the CMC 2007 
>>> and I2B2
>>> 2008 challenges:
>>> 
>>> We did the cmd 2007 with cTAKES 2.5:
>>> 
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repro
>> ducing_results_on_CMC_2007_challenge
>> <https://code.google.com/p/ytex/downloads/list>
>>> 
>>> 
>>> And the i2b2 2008 with the version of cTAKES distributed with the 
>>> first version of ARC:
>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>> 
>>> These are both publicly available datasets, and represent real-world 
>>> problems (in general I believe when publishing a paper the code 
>>> should be reproducible and made publicly available, but that's a different issue).
>>> 
>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to 
>>> upgrade these samples as well.
>>> 
>>> Best,
>>> 
>>> VJ
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry 
>>> <mcmurry.andy@gmail.com
>>> wrote:
>>> 
>>>> +1 suggestion for documenting many examples of "getting started" 
>>>> +NLP
>>>> datasets.
>>>> 
>>>> I have at least one we can use that was created by our lead 
>>>> Pathologist
>>>> 
>>>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>> es/train/traincase.xml
>>>> 
>>>> We should provide at least one sample for each domain.
>>>> Trouble is, privacy requires that these examples be made up by hand 
>>>> and not copy-pasted from EMR systems.
>>>> 
>>>> --Andy
>>>> 
>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>> girinambari@gmail.com>
>>>> wrote:
>>>> 
>>>>> +1 for this observation Andy!
>>>>> 
>>>>> Lowering time will motive users in writing blogs about features, 
>>>>> how
>> to,
>>>>> etc., which reduces core team work load on documentation.
>>>>> 
>>>>> I have been trying to write a small "how to write standalone 
>>>>> client for ctakes" with my experience (I saw at least 4 users 
>>>>> posted similar
>>>> question
>>>>> in last 2 months), but not getting enough time because ctakes 
>>>>> depends
>> on
>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), 
>>>>> most
>> of
>>>>> my spare time is being spent on juggling between these frameworks,
>>>> posting
>>>>> and browsing those forums, relating observations to ctakes code. I
>> think
>>>> we
>>>>> need to have some high level documentation about these (with links 
>>>>> to corresponding forums).
>>>>> 
>>>>> Above case is for developers (I think this will be more user base 
>>>>> as
>>>> ctakes
>>>>> progress), for users I think documentation is lot better though 
>>>>> some improvements need to be done.
>>>>> 
>>>>> As a developer I felt tough with lack of sample training data (I 
>>>>> am
>> still
>>>>> struggling in this area even though I browsed all relevant code),
>> though
>>>>> training class are there. I understood that there are licensing 
>>>>> issues
>>>> with
>>>>> REAL data, but at least some hand made example sentences, which 
>>>>> may not
>>>> be
>>>>> real but helps developers in understanding the type/structure of 
>>>>> input TRAINING classes expecting. This way people who browse the 
>>>>> code can
>>>> reverse
>>>>> engineer and develop their own models. Sorry if you guys feel this 
>>>>> as novice issue, but I feel most of the developers will be novice 
>>>>> when
>> they
>>>>> adopt a system and Machine Learning/NLP is ocean. Some 
>>>>> documentation in this area will same lot of time for us.
>>>>> 
>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>> 
>>>>> Thank you,
>>>>> Giri
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry 
>>>>> <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> ctakes is at a point where we have a LOT of features but it is 
>>>>>> still
>>>> hard
>>>>>> to get started.
>>>>>> 
>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>> obvious
>>>>>> and requires hand holding.
>>>>>> This is very typical in early FOSS projects.
>>>>>> 
>>>>>> Lowering the time to get invested in ctakes gets more users AND 
>>>>>> better
>>>> bug
>>>>>> reports, FAQ, etc.
>>>>>> 
>>>>>> thoughts?
>>>>>> --Andy
>>>>>> 
>>>>>> 
>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>> Pei.Chen@childrens.harvard.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I just wanted to gauge the interest of creating the next release

>>>>>>> of
>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>> 
>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>> Plenty of bug fixes and new components including:
>>>>>>> - New CEM Instance Template population
>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>> - New optional Clear POSTagger
>>>>>>> - New regression testing component
>>>>>>> 
>>>>>>> Should we wait for the Temporal component?
>>>>>>> 
>>>>>>> [1]
>>>>>> 
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>> 22%20AND%20project%20%3D%20CTAKES
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message