ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koola, Jejo David" <jejo.d.ko...@vanderbilt.edu>
Subject Re: sentence detector model
Date Mon, 29 Sep 2014 20:49:11 GMT
How about this idea the training/test set:

1) Start with a document with NO newlines. Perhaps just the entire document is a single paragraph.

2) Then, any sentence detector should be able to parse it correctly.  
3) Then, deterministically add newlines to the document:  some after punctuation; some after
a word; some after a sentence fragment


On Sep 29, 2014, at 3:43 PM, Chen, Pei <Pei.Chen@childrens.harvard.edu> wrote:

> Assuming we have a representative training set, are there any objections if we default
cTAKES to this SentenceAnnotator + Model?
> For the upcoming release:
> - Consolidate the existing sentence detector, ytex sentence dectector into this new?

> - Allow a config parameter to still allow an override of a hard break on newline chars.
 That way, we won't have maintain multiple sentence annotators and it'll be less confusing
for new users...
> --Pei 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>> Sent: Monday, September 29, 2014 2:47 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: sentence detector model
>> That does sound like it would be useful since MIMIC does have both kinds of
>> linebreak styles in different notes. If I did some annotations on such a
>> dataset would it be re-distributable, say on the physionet website? I believe
>> the ShARe project has a download site there (it is a layer of annotations on
>> MIMIC). Another option would be you posting your raw data there and I
>> could post offset-based annotations on a public repo like github.
>> Tim
>> On 09/29/2014 01:54 PM, Peter Szolovits wrote:
>>> I have a set of about 27K documents from MIMIC (circa 2009) in which I
>> have replaced the weird PHI markers by synthesized pseudonymous data.
>> These have natural sentence breaks (typically in the middle of lines), normal
>> paragraph structure, bulleted lists, etc.  Assuming it goes to people who have
>> signed the MIMIC DUA, I could provide these if you are interested.  --Pete
>> Sz.
>>> On Sep 29, 2014, at 1:37 PM, Miller, Timothy
>> <Timothy.Miller@childrens.harvard.edu> wrote:
>>>> Some of them are a bit artificial for this task, with notes being
>>>> annotated as one sentence per line and offset punctuation. I think
>>>> maybe the 2008 and 2009 data might have original formatting though,
>>>> with newlines not always breaking sentences. That has certain
>>>> advantages over raw MIMIC for training since the PHI isn't so weirdly
>>>> formatted, but then again is not a mix of styles (that is, the styles
>>>> of newline always terminates sentence vs. sometimes terminates
>>>> sentence). I think it would still have to be paired with another dataset
>> be a representative sample.
>>>> Tim
>>>> On 09/29/2014 01:24 PM, vijay garla wrote:
>>>>> Why not use the i2b2 corpora?
>>>>> On Monday, September 29, 2014, Dligach, Dmitriy <
>>>>> Dmitriy.Dligach@childrens.harvard.edu> wrote:
>>>>>> Maybe creating a made-up set of sentences would be an option? That
>>>>>> way we could agree on the annotation of concrete cases. Although
>>>>>> this would be more of a unit test than a corpus.
>>>>>> Dima
>>>>>> On Sep 27, 2014, at 12:15, Miller, Timothy <
>>>>>> Timothy.Miller@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>> I've just been using the opennlp command line cross validator
>>>>>>> the
>>>>>> small dataset i annotated (along with some eyeballing). It would
>>>>>> cool if there was a standard clinical resource available for this
>>>>>> task, but I hadn't considered it much because the data I annotated
>>>>>> pulls from multiple datasets and the process of  arranging with
>>>>>> different institutions to make something like that available would
>> probably be a nightmare.
>>>>>>> Tim
>>>>>>> Sent from my iPad. Sorry about the typos.
>>>>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
>>>>>> Dmitriy.Dligach@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>>> Tim, thanks for working on this!
>>>>>>>> Question: do we have some formal way of evaluating the sentence
>>>>>> detector? Maybe we should come up with some dev set that would
>>>>>> include examples from mimic...
>>>>>>>> Dima
>>>>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
>>>>>> Timothy.Miller@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>>>> I have been working on the sentence detector newline
>>>>>>>>> training a
>>>>>> model to probabilistically split sentences on newlines rather than
>>>>>> forcing sentence breaks. I have checked in a model to the repo
>>>>>> under ctakes-core-res. I also attached a patch to ctakes-core to
the jira
>> issue:
>>>>>>>>> https://issues.apache.org/jira/browse/CTAKES-41
>>>>>>>>> for people to test. The status of my testing is that
it doesn't
>>>>>>>>> seem
>>>>>> to break on notes where ctakes worked well before (those where
>>>>>> newlines are always sentence breaks), and is a slight improvement
>>>>>> on notes where newlines may or may not be sentence breaks. Once
>> the
>>>>>> change is checked in we can continue improving the model by adding
>>>>>> more data and features, but the first hurdle I'd like to get past
>>>>>> is making sure it runs well enough on the type of data that the old
>>>>>> model worked well on. Let me know if you have any questions.
>>>>>>>>> Thanks
>>>>>>>>> Tim

View raw message