Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CFC9010ED0 for ; Thu, 18 Jul 2013 14:24:58 +0000 (UTC) Received: (qmail 90645 invoked by uid 500); 18 Jul 2013 14:24:58 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 90627 invoked by uid 500); 18 Jul 2013 14:24:58 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 90618 invoked by uid 99); 18 Jul 2013 14:24:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 14:24:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [134.174.13.91] (HELO mailsmtp1.childrenshospital.org) (134.174.13.91) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 14:24:53 +0000 Received: from pps.filterd (mailsmtp1.childrenshospital.org [127.0.0.1]) by mailsmtp1.childrenshospital.org (8.14.5/8.14.5) with SMTP id r6IEJhWU006975 for ; Thu, 18 Jul 2013 10:24:10 -0400 Received: from smtpndc2.chboston.org (smtpndc2.chboston.org [10.20.50.105]) by mailsmtp1.childrenshospital.org with ESMTP id 1dpgmgrw8t-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 18 Jul 2013 10:24:10 -0400 Received: from pps.filterd (smtpndc2.chboston.org [127.0.0.1]) by smtpndc2.chboston.org (8.14.5/8.14.5) with SMTP id r6IEEbgq012219 for ; Thu, 18 Jul 2013 10:24:10 -0400 Received: from chexhubcasbdc1.chboston.org (chexhubcasbdc1.chboston.org [10.20.18.71]) by smtpndc2.chboston.org with ESMTP id 1dn9jjnujy-2 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Thu, 18 Jul 2013 10:24:10 -0400 Received: from CHEXMBX3A.CHBOSTON.ORG ([fe80::8df1:9966:b0b0:841d]) by CHEXHUBCASBDC1.CHBOSTON.ORG ([::1]) with mapi id 14.02.0342.003; Thu, 18 Jul 2013 10:24:06 -0400 From: "Savova, Guergana" To: "dev@ctakes.apache.org" Subject: RE: Next cTAKES release (3.1)? Thread-Topic: Next cTAKES release (3.1)? Thread-Index: Ac43GHqK6U0Nx/q/Qf2ZeJknJlgNhg87IPVwAOGXlgAAAnyUAP//+xjUgAGKsICAABvPgP/pR26A Date: Thu, 18 Jul 2013 14:24:05 +0000 Message-ID: References: <924DE05C19409B438EB81DE683A942D9104D60D2@CHEXMBX1A.CHBOSTON.ORG> <1ED9725F-7696-4801-9806-DEADE8642FB6@gmail.com> <15FF5F43-AAC7-41A6-AB32-5ABD06FC97E2@gmail.com> <924DE05C19409B438EB81DE683A942D91055D12F@CHEXMBX1A.CHBOSTON.ORG> <51D35490.9030704@childrens.harvard.edu> <924DE05C19409B438EB81DE683A942D91055EAF3@CHEXMBX1A.CHBOSTON.ORG> <6D19EB64-E4CF-469E-9F2D-740A477E0A93@gmail.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.7.2.138] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-07-18_06:2013-07-18,2013-07-18,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-07-18_06:2013-07-18,2013-07-18,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org Actually, MTsamples is what iDASH downloaded for their notes repository. --Guergana -----Original Message----- From: andy mcmurry [mailto:mcmurry.andy@gmail.com]=20 Sent: Wednesday, July 03, 2013 7:26 PM To: dev@ctakes.apache.org Subject: Re: Next cTAKES release (3.1)? Mtsamples has lots of free public examples already but we aren't using them= yet. This is probably because mtsamples don't have the annotations we nee= d to use them as training examples. On Jul 3, 2013 2:46 PM, "Hephaestus Studio" wrote: > @Andy - Not a doctor yet, but soon! Thanks for the promotion though,=20 > one more year! > > - Apropos meds or clinical type questions: any developer on here can=20 > feel free to shoot me a quick question via the list anytime, Id be=20 > happy to confirm that a drug or anything else makes since given a=20 > particular clinical/note context. > > - "I wonder if there is someway in which you could guide us in making=20 > better use of the medical knowledge sources (ontologies) that are=20 > available." - I'd be happy to brainstorm about using existing=20 > resources to help in decision making. We use these all the time in the cl= inic. > > @ Tim+Andy+Chen - I haven't had a chance to really start chewing into=20 > the code, though I hope to over the next year; so, what kind of=20 > examples would be most helpful? > - Any particular disease processes? > - Are you all familiar with the ubiquitous SOAP style presentation=20 > that doctors use to write free notes? The few examples I clicked=20 > through in the repository that Chen pointed me too are very sparse.=20 > Would we want gradations? E.g., a scale for "well done" notes to "very=20 > quick I-dont-care-because-I'm-in-a-rush" notes? > > @ Chen - Thank you for the kind words. It's nice to be welcomed by a=20 > community in which you hope to integrate. And thank you for pointing=20 > me to the directory with the current sample notes. This was very=20 > helpful in determining where those are at in there development. I know=20 > that each of your hospitals have a wealth of HIPAA-closed notes, but=20 > I'll see what I can do to make some "stereotypical" open-notes for=20 > common disease presentations. Again: maybe a scale, not necessarily=20 > just on brevity but some other metric, whose continuum represented=20 > various permutations of degrees of something, maybe of difficulty in=20 > processing? Apropos code, > Chen: I will help where I can but where I want to be is elbow deep in=20 > the code :) > > Finally: I haven't had a chance to look into some of the links from=20 > earlier in this thread regarding open access repositories of free text=20 > clinical notes: what do you all feel the quality of these resources are? > Abundant but low quality? Paucity but those that are there are high quali= ty? > > Bottom line: no problem either answering contextual questions (can=20 > afib be associated with a lower gi bleed??) and no problem writing=20 > some notes, only question would be, before I put in any time: what diseas= e/specialty domain? > and would we want some system that put them on a continuum of some=20 > variable, say, brevity or "readability"? > > Just thinking before leaping, > > Thanks, > JG > > Sent from my iPhone > > On Jul 2, 2013, at 21:23, "Chen, Pei" > wrote: > > > Hi John, > > Welcome! There are actually many ways to contribute and it's not > limited to just code. It's always great to hear new ideas and=20 > suggestions on how to improve the software. Therefore even, things=20 > like user feedback, documentation, new use cases, essentially anything=20 > that will make things better would be awesome! > > > > To get started, I would suggest subscribing to the email lists. If=20 > > you > would like to contribute anything, just create an Jira account (anyone=20 > should be able to do this), and add/review Jira items (add attachments=20 > if you like) and we can even help integrate it. > > > > We normally use Jira to keep track of issues: > > [1] https://issues.apache.org/jira/browse/ctakes > > > > Current collection of sample test notes that have been collected=20 > > over > the years: > > > https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/t > estdata/input/plaintext/ > > > > ________________________________________ > > From: Tim Miller [timothy.miller@childrens.harvard.edu] > > Sent: Tuesday, July 02, 2013 6:31 PM > > To: dev@ctakes.apache.org > > Subject: Re: Next cTAKES release (3.1)? > > > > Agreed that you could definitely help out, and that would be a great=20 > > way to do so. We don't really have "examples" right now, more like=20 > > just short test sentences for showing simple results and verifying=20 > > that nothing has been broken by changes. I think regular length fake=20 > > but realistic notes would be very useful. > > Tim > > > > On 07/02/2013 05:19 PM, John Green wrote: > >> Hi all, > >> > >> Ive been following this mail list for a couple of months. Im a=20 > >> third > year medical student rounding the bend toward my MD. I used to be a=20 > computer programmer, however, and continue my own projects. Im very=20 > interested in contributing eventually to cTakes development. In the=20 > meantime, given the current talk of examples, if any domain specific=20 > examples needed generated I am domain knowledgable enough that I could=20 > pound out a few free text notes made to order. > >> > >> Let me know, you all may already have docs on hand willing todo=20 > >> this, > but if not... > >> > >> John Green > >> > >> Sent from my iPhone > >> > >> On Jun 28, 2013, at 8:59, "Chen, Pei"=20 > >> > wrote: > >> > >>> I completely agree with making cTAKES easier use. I think it is > exciting to hear the different use cases here and understanding where=20 > some of the areas that need improvements are (which we haven't thought=20 > about earlier). > >>> I think Tim's suggestions and the 3 concrete actionable items=20 > >>> makes a > lot of sense. Hopefully it should attract new users, adopters, and=20 > perhaps more committers. > >>> > >>>> i) Make the typesystem forefront in documentation -- generate > javadocs and > >>>> have as a link on the ctakes frontpage/sidebar > >>>> ii) Similar to the way that we are aiming to have tests in every > module, also > >>>> have clearly labeled examples in every module that set up a=20 > >>>> pipeline, > run on > >>>> sample notes (could be the same sample notes from the tests), and=20 > >>>> do something with the results. > >>>> iii) Follow Giri's recommendation to have example training data=20 > >>>> for > people > >>>> who want to take the next step and train their own models > >>> I think Java developers are accustomed to including a library as a > dependency/jar, have an API to pass input, and get the results via=20 > pojos; So the examples could initially shield the complexity of=20 > wiring a pipeline together etc. > >>> If we can improve the API's and how it gets integrated with other > apps, we can add any GUI/CLI tools on top of this afterwards. > >>> > >>> --Pei > >>> > >>>> -----Original Message----- > >>>> From: Miller, Timothy=20 > >>>> [mailto:Timothy.Miller@childrens.harvard.edu] > >>>> Sent: Friday, June 28, 2013 8:00 AM > >>>> To: dev@ctakes.apache.org > >>>> Subject: Re: Next cTAKES release (3.1)? > >>>> > >>>> Very interesting discussion. I think Giri is right about giving > example training > >>>> data in the format that our training code can read. While our > ultimate goal > >>>> would be to build and release models that are completely domain-=20 > >>>> independent, in the real world it is almost always better to use=20 > >>>> some domain-specific data and we should think more about how to=20 > >>>> facilitate > that. > >>>> > >>>> As for making it easier to get started, it is not totally clear=20 > >>>> to me > what this > >>>> means/how to do it so it might be useful to get specific about=20 > >>>> what > this > >>>> means. I think our biggest hurdle is > >>>> > >>>> 1) Prerequisite of understanding UIMA/UIMAFit > >>>> > >>>> Since UIMAFit is officially becoming part of UIMA that will be > easier, and > >>>> hopefully people will just learn the easier (in my opinion)=20 > >>>> UIMAFit > way than > >>>> the standard UIMA way of doing things. Is there something we can=20 > >>>> be > doing > >>>> to make understanding UIMA easier? Or do we just need to say=20 > >>>> upfront > that > >>>> this is a prerequisite and hope that people don't give up due to=20 > >>>> this > thing that > >>>> is out of our control? > >>>> > >>>> Another hurdle is: > >>>> > >>>> 2) cTAKES is a multi-purpose developer-aimed tool > >>>> > >>>> So it's not just a matter of hiding complexity -- at some point > people have to > >>>> understand their problem, understand cTAKES' capabilities, and=20 > >>>> start > coding. > >>>> Pei's GUI will help for some common use cases but will not remove=20 > >>>> the requirement that someone at the organization knows cTAKES. > >>>> I think one part of this problem is the fact that the typesystem=20 > >>>> is > not well > >>>> documented. A developer needs to know what the output is (objects=20 > >>>> from the typesystem), how to get them (which modules/pipelines),=20 > >>>> and what information is in them. So maybe on this end my=20 > >>>> recommendation would > be: > >>>> i) Make the typesystem forefront in documentation -- generate > javadocs and > >>>> have as a link on the ctakes frontpage/sidebar > >>>> ii) Similar to the way that we are aiming to have tests in every > module, also > >>>> have clearly labeled examples in every module that set up a=20 > >>>> pipeline, > run on > >>>> sample notes (could be the same sample notes from the tests), and=20 > >>>> do something with the results. > >>>> iii) Follow Giri's recommendation to have example training data=20 > >>>> for > people > >>>> who want to take the next step and train their own models > >>>> > >>>> This is quite a bit of developer overhead, so it's worth asking > whether you > >>>> agree with my "diagnosis" and "treatment" or whether you think=20 > >>>> there > are > >>>> different problems/solutions that should be higher priority. > >>>> > >>>> Tim > >>>> > >>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote: > >>>>> Hi Vijay and Andy, > >>>>> > >>>>> Thanks for sharing those examples. > >>>>> > >>>>> "Trouble is, privacy requires that these examples be made up by han= d" > >>>>> > >>>>> Agree with this statement and this is very valid concern. > >>>>> > >>>>> In "getting started examples", I think we should just have=20 > >>>>> couple of entries (5-10 small entries), not more than that (with=20 > >>>>> explicit statement like "ONLY EXAMPLE", NOT GOOD FOR REAL=20 > >>>>> USAGE). I > >>>> understand > >>>>> handcrafting these may not be easy because we are not medical=20 > >>>>> domain experts, but I feel worth time, because it brings in more=20 > >>>>> user > community. > >>>>> > >>>>> Thank you, > >>>>> Giri > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry > >>>> wrote: > >>>>>> GREAT ! > >>>>>> > >>>>>> The i2b2 data though isn't publicly distributable, you still=20 > >>>>>> need to request access to it since it is "semi private" > >>>>>> > >>>>>> > >>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla wrote= : > >>>>>> > >>>>>>> We released code on using cTAKES to annotate clinical text and=20 > >>>>>>> SVMs that use the annotations to classify clinical text from=20 > >>>>>>> the CMC > 2007 > >>>>>>> and I2B2 > >>>>>>> 2008 challenges: > >>>>>>> > >>>>>>> We did the cmd 2007 with cTAKES 2.5: > >>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#R > >>>> epr > >>>> o > >>>>>> ducing_results_on_CMC_2007_challenge > >>>>>> > >>>>>>> And the i2b2 2008 with the version of cTAKES distributed with=20 > >>>>>>> the first version of ARC: > >>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 > >>>>>>> > >>>>>>> These are both publicly available datasets, and represent > real-world > >>>>>>> problems (in general I believe when publishing a paper the=20 > >>>>>>> code should be reproducible and made publicly available, but=20 > >>>>>>> that's a > different > >>>> issue). > >>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would=20 > >>>>>>> like > to > >>>>>>> upgrade these samples as well. > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> VJ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry=20 > >>>>>>> >>>>>>> wrote: > >>>>>>> > >>>>>>>> +1 suggestion for documenting many examples of "getting started" > >>>>>>>> +NLP > >>>>>>>> datasets. > >>>>>>>> > >>>>>>>> I have at least one we can use that was created by our lead=20 > >>>>>>>> Pathologist > >>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input > >>>> /cas > >>>>>> es/train/traincase.xml > >>>>>>>> We should provide at least one sample for each domain. > >>>>>>>> Trouble is, privacy requires that these examples be made up=20 > >>>>>>>> by > hand > >>>>>>>> and not copy-pasted from EMR systems. > >>>>>>>> > >>>>>>>> --Andy > >>>>>>>> > >>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < > >>>>>> girinambari@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> +1 for this observation Andy! > >>>>>>>>> > >>>>>>>>> Lowering time will motive users in writing blogs about=20 > >>>>>>>>> features, how > >>>>>> to, > >>>>>>>>> etc., which reduces core team work load on documentation. > >>>>>>>>> > >>>>>>>>> I have been trying to write a small "how to write standalone=20 > >>>>>>>>> client for ctakes" with my experience (I saw at least 4=20 > >>>>>>>>> users posted similar > >>>>>>>> question > >>>>>>>>> in last 2 months), but not getting enough time because=20 > >>>>>>>>> ctakes depends > >>>>>> on > >>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework=20 > >>>>>>>>> etc.,), most > >>>>>> of > >>>>>>>>> my spare time is being spent on juggling between these > frameworks, > >>>>>>>> posting > >>>>>>>>> and browsing those forums, relating observations to ctakes code= . > I > >>>>>> think > >>>>>>>> we > >>>>>>>>> need to have some high level documentation about these (with > links > >>>>>>>>> to corresponding forums). > >>>>>>>>> > >>>>>>>>> Above case is for developers (I think this will be more user=20 > >>>>>>>>> base as > >>>>>>>> ctakes > >>>>>>>>> progress), for users I think documentation is lot better=20 > >>>>>>>>> though some improvements need to be done. > >>>>>>>>> > >>>>>>>>> As a developer I felt tough with lack of sample training=20 > >>>>>>>>> data (I am > >>>>>> still > >>>>>>>>> struggling in this area even though I browsed all relevant=20 > >>>>>>>>> code), > >>>>>> though > >>>>>>>>> training class are there. I understood that there are=20 > >>>>>>>>> licensing issues > >>>>>>>> with > >>>>>>>>> REAL data, but at least some hand made example sentences,=20 > >>>>>>>>> which may not > >>>>>>>> be > >>>>>>>>> real but helps developers in understanding the=20 > >>>>>>>>> type/structure of input TRAINING classes expecting. This way=20 > >>>>>>>>> people who browse the code can > >>>>>>>> reverse > >>>>>>>>> engineer and develop their own models. Sorry if you guys=20 > >>>>>>>>> feel > this > >>>>>>>>> as novice issue, but I feel most of the developers will be=20 > >>>>>>>>> novice when > >>>>>> they > >>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some=20 > >>>>>>>>> documentation in this area will same lot of time for us. > >>>>>>>>> > >>>>>>>>> I wish there will be some activity in this area from ctakes=20 > >>>>>>>>> core > team. > >>>>>>>>> > >>>>>>>>> Thank you, > >>>>>>>>> Giri > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry=20 > >>>>>>>>> >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> ctakes is at a point where we have a LOT of features but it=20 > >>>>>>>>>> is still > >>>>>>>> hard > >>>>>>>>>> to get started. > >>>>>>>>>> > >>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is=20 > >>>>>>>>>> not > >>>>>> obvious > >>>>>>>>>> and requires hand holding. > >>>>>>>>>> This is very typical in early FOSS projects. > >>>>>>>>>> > >>>>>>>>>> Lowering the time to get invested in ctakes gets more users=20 > >>>>>>>>>> AND better > >>>>>>>> bug > >>>>>>>>>> reports, FAQ, etc. > >>>>>>>>>> > >>>>>>>>>> thoughts? > >>>>>>>>>> --Andy > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < > >>>>>>>> Pei.Chen@childrens.harvard.edu> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> I just wanted to gauge the interest of creating the next > release > >>>>>>>>>>> of > >>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira- > >>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed=20 > >>>>>>>>>>> or > closed. > >>>>>>>>>> Plenty of bug fixes and new components including: > >>>>>>>>>>> - New CEM Instance Template population > >>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler > >>>>>>>>>>> - New optional Clear POSTagger > >>>>>>>>>>> - New regression testing component > >>>>>>>>>>> > >>>>>>>>>>> Should we wait for the Temporal component? > >>>>>>>>>>> > >>>>>>>>>>> [1] > >>>> https://issues.apache.org/jira/issues/?jql=3DfixVersion%20%3D%20%22 > >>>> 3.1% > >>>>>> 22%20AND%20project%20%3D%20CTAKES > > >