ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: Next cTAKES release (3.1)?
Date Tue, 02 Jul 2013 14:13:23 GMT
I agree with Tim's diagnosis and treatment plan.

-----Original Message-----
From: dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Chen, Pei
Sent: Friday, June 28, 2013 9:00 AM
To: dev@ctakes.apache.org
Subject: RE: Next cTAKES release (3.1)?

I completely agree with making cTAKES easier use.  I think it is exciting to hear the different
use cases here and understanding where some of the areas that need improvements are (which
we haven't thought about earlier).
 I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully
it should attract new users, adopters, and perhaps more committers.

> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models

I think Java developers are accustomed to including a library as a dependency/jar, have an
API to pass input, and get the results via pojos;  So the examples could initially shield
the complexity of wiring a pipeline together etc.  
If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI
tools on top of this afterwards.

--Pei

> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Friday, June 28, 2013 8:00 AM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Very interesting discussion. I think Giri is right about giving example training
> data in the format that our training code can read. While our ultimate goal
> would be to build and release models that are completely domain-
> independent, in the real world it is almost always better to use some
> domain-specific data and we should think more about how to facilitate that.
> 
> As for making it easier to get started, it is not totally clear to me what this
> means/how to do it so it might be useful to get specific about what this
> means. I think our biggest hurdle is
> 
> 1) Prerequisite of understanding UIMA/UIMAFit
> 
> Since UIMAFit is officially becoming part of UIMA that will be easier, and
> hopefully people will just learn the easier (in my opinion) UIMAFit way than
> the standard UIMA way of doing things. Is there something we can be doing
> to make understanding UIMA easier? Or do we just need to say upfront that
> this is a prerequisite and hope that people don't give up due to this thing that
> is out of our control?
> 
> Another hurdle is:
> 
> 2) cTAKES is a multi-purpose developer-aimed tool
> 
> So it's not just a matter of hiding complexity -- at some point people have to
> understand their problem, understand cTAKES' capabilities, and start coding.
> Pei's GUI will help for some common use cases but will not remove the
> requirement that someone at the organization knows cTAKES.
> I think one part of this problem is the fact that the typesystem is not well
> documented. A developer needs to know what the output is (objects from
> the typesystem), how to get them (which modules/pipelines), and what
> information is in them. So maybe on this end my recommendation would be:
> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models
> 
> This is quite a bit of developer overhead, so it's worth asking whether you
> agree with my "diagnosis" and "treatment" or whether you think there are
> different problems/solutions that should be higher priority.
> 
> Tim
> 
> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> > Hi Vijay and Andy,
> >
> > Thanks for sharing those examples.
> >
> > "Trouble is, privacy requires that these examples be made up by hand"
> >
> > Agree with this statement and this is very valid concern.
> >
> > In "getting started examples", I think we should just have couple of
> > entries (5-10 small entries), not more than that (with explicit
> > statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
> understand
> > handcrafting these may not be easy because we are not medical domain
> > experts, but I feel worth time, because it brings in more user community.
> >
> > Thank you,
> > Giri
> >
> >
> >
> >
> >
> > On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> <mcmurry.andy@gmail.com>wrote:
> >
> >> GREAT !
> >>
> >> The i2b2 data though isn't publicly distributable, you still need to
> >> request access to it since it is "semi private"
> >>
> >>
> >> On Jun 27, 2013, at 9:52 PM, vijay garla <vngarla@gmail.com> wrote:
> >>
> >>> We released code on using cTAKES to annotate clinical text and SVMs
> >>> that use the annotations to classify clinical text from the CMC 2007
> >>> and I2B2
> >>> 2008 challenges:
> >>>
> >>> We did the cmd 2007 with cTAKES 2.5:
> >>>
> >>
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
> o
> >> ducing_results_on_CMC_2007_challenge
> >> <https://code.google.com/p/ytex/downloads/list>
> >>>
> >>> And the i2b2 2008 with the version of cTAKES distributed with the
> >>> first version of ARC:
> >>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>
> >>> These are both publicly available datasets, and represent real-world
> >>> problems (in general I believe when publishing a paper the code
> >>> should be reproducible and made publicly available, but that's a different
> issue).
> >>>
> >>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
> >>> upgrade these samples as well.
> >>>
> >>> Best,
> >>>
> >>> VJ
> >>>
> >>>
> >>>
> >>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
> >>> <mcmurry.andy@gmail.com
> >>> wrote:
> >>>
> >>>> +1 suggestion for documenting many examples of "getting started"
> >>>> +NLP
> >>>> datasets.
> >>>>
> >>>> I have at least one we can use that was created by our lead
> >>>> Pathologist
> >>>>
> >>>>
> >>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
> >> es/train/traincase.xml
> >>>> We should provide at least one sample for each domain.
> >>>> Trouble is, privacy requires that these examples be made up by hand
> >>>> and not copy-pasted from EMR systems.
> >>>>
> >>>> --Andy
> >>>>
> >>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >> girinambari@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> +1 for this observation Andy!
> >>>>>
> >>>>> Lowering time will motive users in writing blogs about features,
> >>>>> how
> >> to,
> >>>>> etc., which reduces core team work load on documentation.
> >>>>>
> >>>>> I have been trying to write a small "how to write standalone
> >>>>> client for ctakes" with my experience (I saw at least 4 users
> >>>>> posted similar
> >>>> question
> >>>>> in last 2 months), but not getting enough time because ctakes
> >>>>> depends
> >> on
> >>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
> >>>>> most
> >> of
> >>>>> my spare time is being spent on juggling between these frameworks,
> >>>> posting
> >>>>> and browsing those forums, relating observations to ctakes code.
I
> >> think
> >>>> we
> >>>>> need to have some high level documentation about these (with links
> >>>>> to corresponding forums).
> >>>>>
> >>>>> Above case is for developers (I think this will be more user base
> >>>>> as
> >>>> ctakes
> >>>>> progress), for users I think documentation is lot better though
> >>>>> some improvements need to be done.
> >>>>>
> >>>>> As a developer I felt tough with lack of sample training data (I
> >>>>> am
> >> still
> >>>>> struggling in this area even though I browsed all relevant code),
> >> though
> >>>>> training class are there. I understood that there are licensing
> >>>>> issues
> >>>> with
> >>>>> REAL data, but at least some hand made example sentences, which
> >>>>> may not
> >>>> be
> >>>>> real but helps developers in understanding the type/structure of
> >>>>> input TRAINING classes expecting. This way people who browse the
> >>>>> code can
> >>>> reverse
> >>>>> engineer and develop their own models. Sorry if you guys feel this
> >>>>> as novice issue, but I feel most of the developers will be novice
> >>>>> when
> >> they
> >>>>> adopt a system and Machine Learning/NLP is ocean. Some
> >>>>> documentation in this area will same lot of time for us.
> >>>>>
> >>>>> I wish there will be some activity in this area from ctakes core
team.
> >>>>>
> >>>>> Thank you,
> >>>>> Giri
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
> >>>>> <mcmurry.andy@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> ctakes is at a point where we have a LOT of features but it
is
> >>>>>> still
> >>>> hard
> >>>>>> to get started.
> >>>>>>
> >>>>>> Judging from the mailing lists a lot of how cTakes works is
not
> >> obvious
> >>>>>> and requires hand holding.
> >>>>>> This is very typical in early FOSS projects.
> >>>>>>
> >>>>>> Lowering the time to get invested in ctakes gets more users
AND
> >>>>>> better
> >>>> bug
> >>>>>> reports, FAQ, etc.
> >>>>>>
> >>>>>> thoughts?
> >>>>>> --Andy
> >>>>>>
> >>>>>>
> >>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>> Pei.Chen@childrens.harvard.edu>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I just wanted to gauge the interest of creating the next
release
> >>>>>>> of
> >>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>> There have already been 22/53 issues [1] marked as fixed
or closed.
> >>>>>> Plenty of bug fixes and new components including:
> >>>>>>> - New CEM Instance Template population
> >>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>> - New optional Clear POSTagger
> >>>>>>> - New regression testing component
> >>>>>>>
> >>>>>>> Should we wait for the Temporal component?
> >>>>>>>
> >>>>>>> [1]
> >>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
> >> 22%20AND%20project%20%3D%20CTAKES
> >>>>>>
> >>>>
> >>


Mime
View raw message