Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ctakes.apache.org
Subject: Re: Combining Knowledge- and Data-driven Methods for
 De-identification of Clinical Narratives
To: dev@ctakes.apache.org
References: 
 <CAPqz87osK7XfwK-fn_tox9hhCzfNG_Ew3mJ-SF0jPF-6rR0K+A@mail.gmail.com>
 <CAPMhGj5kE62JT_kaRhKd1te-0kDtEafxUcvK15VgAFkO72Wk0w@mail.gmail.com>
 <569CB8AF.8030603@averbis.com> <569CF158.7060205@averbis.com>
 <630BB0DF-D699-4E77-A78D-47CADE4534F2@wiredinformatics.com>
 <56A9DBEF.7040704@averbis.com>
 <CAPqz87r4Lwn1XF3nO3SjcAD-5Hq0x68MfBmHttotbSd5ObimyA@mail.gmail.com>
 <56AA2C60.3070604@averbis.com> <56AB322A.60904@averbis.com>
 <56B065C0.90000@averbis.com> <56B1B4AD.4050807@averbis.com>
 <56E172FA.2010705@averbis.com>
 <CAPMhGj5qmfAow5-h9LQXzhi5hOzywdX1bDy_U0CUyOPKyocSyg@mail.gmail.com>
 <56E1D199.3040607@averbis.com>
 <CAPMhGj6z711=pN2nBags64gvz4jiT8H4YV80RCnNvbQybe=jqQ@mail.gmail.com>
 <1837bda172a04603a44a32c4027d6c3c@CHEXMAIL4A.CHBOSTON.ORG>
 <CAPMhGj5PB7MjM44fRGVkpzjczy3jwexavXgcVo9JW45_LTZHwg@mail.gmail.com>
 <1d4284c80fd247608a13b655abae7976@CHEXMAIL4A.CHBOSTON.ORG>
 <CAGAUMatXaMCBBvGyeSx7JHB3+1K3PNciwghjMUM8HHMqnKm1Gg@mail.gmail.com>
From: =?UTF-8?Q?Peter_Kl=c3=bcgl?= <peter.kluegl@averbis.com>
Message-ID: <56E29ED2.9070106@averbis.com>
Date: Fri, 11 Mar 2016 11:32:50 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: 
 <CAGAUMatXaMCBBvGyeSx7JHB3+1K3PNciwghjMUM8HHMqnKm1Gg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Hi,

thanks for the notes and links, Andy and Guergana. The software and
articles are very interesting, but, as for my personal interest, we have
our own clinical deidentification software solution at our company
(which works good enough as far as I know). My focus is rather on
helping out in translating the contribution from GATE/JAPE to UIMA/Ruta.
Thus, I concentrate on the existing functionality for now.

What is the final goal of the cTAKES comunity concerning clinical deid
components? Will both sandbox projects be merged, what about statistical
approaches?

@Pei: there was again a problem with the patch (I also missed to add
some files). I attached a new one.

@Azad: I am just curious on which data the rules exactly rely. I think
I'll find the information in the article.
I assume that the 521 docuemnts have been utilized to develop the rules
and the 269 documents to evaluate them. Did you correct the rules also
using the second set? I need to reread to article :-)

Best,

Peter


Am 10.03.2016 um 23:22 schrieb andy mcmurry:
> *** For cross-validation, you can evaluate de-identified notes data from
> i2b2 challenge** *
> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/data/models/
>
> *Methods for model generation of FeatureSet described here: *
>
> *Improved de-identification of physician notes through integrative modeling
> of both public and private medical text*
> http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-112
>
> Major objective of that study was to help provide external examples to
> cross train / retrain other methods.
>
> hope this helps,
> --Andy
>
>
>
> On Thu, Mar 10, 2016 at 1:27 PM, Savova, Guergana <
> Guergana.Savova@childrens.harvard.edu> wrote:
>
>> You can re-build the models that feed into MIST. I personally would not
>> use the default model that MIST comes with as it is not trained on clinical
>> data. In our previous work we found that hand-annotating about 200 docs for
>> PHI (representative of the sample you are going to run the models on)
>> results in building a pretty good model - in the 90's for p, r and f1.
>> However, even with that high performance, the institution that owns the
>> data might be still reluctant to share as it might pose a violation of
>> HIPAA through some potential PHI leaks. In cTAKES our approach has been to
>> de-couple the de-identifcation from the NLP/information extraction. If a
>> user has the need for de-identified data, they could choose their method --
>> manual or otherwise -- and then process through cTAKES. Our focus is the
>> NLP/IE space, while de-identification is a blend of that plus policy....
>>
>> --Guergana
>>
>> -----Original Message-----
>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>> Sent: Thursday, March 10, 2016 4:19 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>>
>> Thanks Guergana.
>>
>>> Yes, the current release of cTAKES has a module for the temporal
>> expressions which includes dates. The normalizer for the temporal
>> expressions is Steven Bethard's timenorm code.
>> Great.
>>
>>> However, if you do de-identification of dates/temporal expressions,
>>> you
>> run the risk of creating incorrect timelines as many of the relative
>> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
>> unlikely to be correctly shifted by any de-identification tool.
>> Indeed, a reason I have not included the dates component.
>>
>>> One de-identification tool is MIST --
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
>> .
>> I don't remember them doing well in the community held evaluation in 2014.
>> Hence, cDeid :)
>>> Guergana Savova, PhD, FACMI
>>> Associate Professor
>>> PI Natural Language Processing Lab
>>> Boston Children's Hospital and Harvard Medical School
>>> 300 Longwood Avenue
>>> Mailstop: BCH3092
>>> Enders 144.1
>>> Boston, MA 02115
>>> Tel: (617) 919-2972
>>> Fax: (617) 730-0817
>>> Harvard Scholar:
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
>>> u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
>>> ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
>>> RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
>>> iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
>>>
>>> -----Original Message-----
>>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>>> Sent: Thursday, March 10, 2016 3:42 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>>>> This means both training data folders? I have access to the data but
>>>> not
>>> to the challenge description.
>>>
>>> Yes. Is there any specific information that you are missing?
>>>>
>>>>> It would be good to incorporate/refactor (basically, GATE API needs
>>>>> to be replaced with UIMA API to generate annotation) the two-pass
>>>>> recognition method for cTAKES - which has a wider application on
>> longitudinal data.
>>>>> This method is used on-top of a number NERs.
>>>>
>>>> I'll take a look.
>>>>
>>>> I do not know how much time I can invest this month. Let's see how
>>>> many
>>> phases I can translate.
>>>> I added the rules for age. Are there jape rules for creating date
>>> annotations?
>>> No. I believe cTAKES has existing component(s) to capture dates?
>>>
>>>> After all rules are translated, they need some major refactoring.
>>>> Jape
>>> and Ruta are quite different in some aspects.
>>> Ok.
>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Please let me know where I can help. I will be available again in
>> April.
>>>>> Cheers,
>>>>> Azad
>>>>>
>>>>> On 10 March 2016 at 13:13, Peter Klügl <peter.kluegl@averbis.com>
>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> sorry, I was quite busy last month.
>>>>>>
>>>>>> I added a new patch, which needs to be applied.
>>>>>>
>>>>>> No new rules, but it's possible now to evaluate everything against
>>>>>> the labelled data of the challenge.
>>>>>>
>>>>>> @Azad:
>>>>>> Which documents exactly did you use to develop the rules?
>>>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
>>> testing-PHI-Gold-fixed?
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>>
>>>>>>> the last patch fixed almost all problems.
>>>>>>>
>>>>>>> I added another one that adds the csv file for the unit test and
>>> extends
>>>>>>> svn-ignore.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I added another patch. I missed to manually add one test file to
>>> version
>>>>>>>> control, and there are still duplicate lines.
>>>>>>>> I hope this patch fixes the remaining problems.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry
>>>>>>>>> for
>>> the
>>>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>>>>
>>>>>>>>> I attached a new patch created with commandline tools wich
>>>>>>>>> looks
>>>>>> correct
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> Pei, can you apply the new patch?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>>>> Thanks Pei.
>>>>>>>>>>
>>>>>>>>>> I fear there was again a problem with the patch. All new files
>>>>>>>>>> are missing (and also the svn-ignore settings).
>>>>>>>>>>
>>>>>>>>>> Can you take a look?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>>>> patch applied.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Pei
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>> Hi Pei,
>>>>>>>>>>>>
>>>>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>>>>
>>>>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>>>>> But yeah, we can even create an extended type system to
>>>>>>>>>>>>> store
>>>>>> these items temporarily and add them into the main/core type
>>>>>> system afterwards.
>>>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
>>>>>>>>>>>>> will
>>>>>> require much more testing.  If it works, we can upgrade it in our
>>> sandbox
>>>>>> area or create a branch if necessary.
>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Pei:
>>>>>>>>>>>>>> are there suitable annotation types in the cTAKES type
>> system?
>>>>>> Some
>>>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
>>>>>>>>>>>>>> map it
>>> to
>>>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>>>> features...
>>>>>>>>>>>>>> @Azad:
>>>>>>>>>>>>>> I changed the rules a bit, especially the capitalization
>>>>>>>>>>>>>> like I
>>>>>> use it
>>>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
>>>>>>>>>>>>>> the
>>> maven
>>>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>>>> extended the
>>>>>>>>>>>>>> regex for the url. I also changed the evaluation order of
>>>>>>>>>>>>>> some
>>>>>> rules
>>>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
>>>>>>>>>>>>>> for
>>>>>> the unit
>>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
>>>>>>>>>>>>>> we
>>>>>> split them up?
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> great. I will integrate them in the project and in the
>>>>>>>>>>>>>>> next
>>>>>> patch.
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all
>> completed.
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
>>>>>>>>>>>>>>>>> any
>>>>>> more volunteers
>>>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
>>>>>>>>>>>>>>>>> <peter.kluegl@averbis.com
>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>>>> Unfortunately,
>>>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
>>>>>>>>>>>>>>>>>> be able
>>>>>> to provide
>>>>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
>>>>>>>>>>>>>>>>>>> starting
>>>>>> point at least
>>>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
>>>>>>>>>>>>>>>>>>> good
>>> if
>>>>>> we use
>>>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
>>>>>>>>>>>>>>>>>>> components
>>>>>> together and
>>>>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>>>>> I think the actual components that would be required
>>>>>>>>>>>>>>>>>>> is
>>>>>> probably best
>>>>>>>>>>>>>>>>>>> left up to what is actually required for best
>>>>>>>>>>>>>>>>>>> performing
>>>>>> c-deid.  The
>>>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
>>>>>>>>>>>>>>>>>>> should
>>> treat
>>>>>> this as
>>>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
>>> pipeline
>>>>>> (in which
>>>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type
>>>>>>>>>>>>>>>>>>> system or
>>>>>> perhaps an
>>>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>>>> discussion to
>>>>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
>>>>>>>>>>>>>>>>>>>> example on
>>>>>> how the
>>>>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
>>>>>>>>>>>>>>>>>>>> in a
>>>>>> quite
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
>>>>>>>>>>>>>>>>>>>> sort of
>>>>>> out-dated"
>>>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
>>> preprocessing
>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
>>> project.
>>>>>>>>>>>>>>>>>>>> Components: On which components may the componetns
>> rely:
>>>>>> tokenizer,
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
>>> single
>>>>>> AE?
>>>>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
>>>>>>>>>>>>>>>>>>>>>> avoid
>>>>>> duplicate
>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
>> RUTA.
>>>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
>>>>>>>>>>>>>>>>>>>> Workbench if
>>>>>> you want, or
>>>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
>>>>>>>>>>>>>>>>>>>>>> for
>>> the
>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
>>>>>>>>>>>>>>>>>>>>>> contribute it
>>>>>> too?
>>>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
>>>>>>>>>>>>>>>>>>>>> available;
>>> i2b2
>>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
>>>>>>>>>>>>>>>>>>>>> A_
>>>>>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
>>>>>>>>>>>>>>>>>>>>> oW
>>>>>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
>>>>>>>>>>>>>>>>>>>>> J9
>>>>>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
>>>>>>>>>>>>>>>>>>>>> &m
>>>>>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
>>>>>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
>>>>>>>>>>>>>>>>>>>>> typically
>>>>>> releases the
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
>>>>>>>>>>>>>>>>>>>>> done on
>>> an
>>>>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
>>>>>>>>>>>>>>>>>>>>> the
>>>>>> validation.
>>>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
>>>>>>>>>>>>>>>>>>>> the
>>>>>> dataset here.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
>>>>>>>>>>>>>>>>>>>>>> cTAKES
>>>>>> components
>>>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
>>>>>>>>>>>>>>>>>>>>>> party
>>> libs
>>>>>> jars that
>>>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
>>>>>>>>>>>>>>>>>>>>>> sure to
>>>>>> take a look
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
>>>>>>>>>>>>>>>>>>>>> should
>>>>>> not be a
>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
>>> independent
>>>>>> component
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
>>>>>>>>>>>>>>>>>>>>> method
>>>>>> have shown
>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
>>>>>>>>>>>>>>>>>>>>> useful
>>>>>> independent
>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>>>>