ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <Timothy.Mil...@childrens.harvard.edu>
Subject Re: sentence detector model
Date Mon, 29 Sep 2014 18:45:53 GMT
That does sound like it would be useful since MIMIC does have both kinds
of linebreak styles in different notes. If I did some annotations on
such a dataset would it be re-distributable, say on the physionet
website? I believe the ShARe project has a download site there (it is a
layer of annotations on MIMIC). Another option would be you posting your
raw data there and I could post offset-based annotations on a public
repo like github.
Tim


On 09/29/2014 01:54 PM, Peter Szolovits wrote:
> I have a set of about 27K documents from MIMIC (circa 2009) in which I have replaced
the weird PHI markers by synthesized pseudonymous data.  These have natural sentence breaks
(typically in the middle of lines), normal paragraph structure, bulleted lists, etc.  Assuming
it goes to people who have signed the MIMIC DUA, I could provide these if you are interested.
 --Pete Sz.
>
> On Sep 29, 2014, at 1:37 PM, Miller, Timothy <Timothy.Miller@childrens.harvard.edu>
wrote:
>
>> Some of them are a bit artificial for this task, with notes being
>> annotated as one sentence per line and offset punctuation. I think maybe
>> the 2008 and 2009 data might have original formatting though, with
>> newlines not always breaking sentences. That has certain advantages over
>> raw MIMIC for training since the PHI isn't so weirdly formatted, but
>> then again is not a mix of styles (that is, the styles of newline always
>> terminates sentence vs. sometimes terminates sentence). I think it would
>> still have to be paired with another dataset to be a representative sample.
>> Tim
>>
>> On 09/29/2014 01:24 PM, vijay garla wrote:
>>> Why not use the i2b2 corpora?
>>>
>>> On Monday, September 29, 2014, Dligach, Dmitriy <
>>> Dmitriy.Dligach@childrens.harvard.edu> wrote:
>>>
>>>> Maybe creating a made-up set of sentences would be an option? That way we
>>>> could agree on the annotation of concrete cases. Although this would be
>>>> more of a unit test than a corpus.
>>>>
>>>> Dima
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 27, 2014, at 12:15, Miller, Timothy <
>>>> Timothy.Miller@childrens.harvard.edu <javascript:;>> wrote:
>>>>
>>>>> I've just been using the opennlp command line cross validator on the
>>>> small dataset i annotated (along with some eyeballing). It would be cool
if
>>>> there was a standard clinical resource available for this task, but I
>>>> hadn't considered it much because the data I annotated pulls from multiple
>>>> datasets and the process of  arranging with different institutions to make
>>>> something like that available would probably be a nightmare.
>>>>> Tim
>>>>>
>>>>> Sent from my iPad. Sorry about the typos.
>>>>>
>>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
>>>> Dmitriy.Dligach@childrens.harvard.edu <javascript:;>> wrote:
>>>>>> Tim, thanks for working on this!
>>>>>>
>>>>>> Question: do we have some formal way of evaluating the sentence
>>>> detector? Maybe we should come up with some dev set that would include
>>>> examples from mimic...
>>>>>> Dima
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
>>>> Timothy.Miller@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>> I have been working on the sentence detector newline issue, training
a
>>>> model to probabilistically split sentences on newlines rather than forcing
>>>> sentence breaks. I have checked in a model to the repo under
>>>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue:
>>>>>>> https://issues.apache.org/jira/browse/CTAKES-41
>>>>>>>
>>>>>>> for people to test. The status of my testing is that it doesn't
seem
>>>> to break on notes where ctakes worked well before (those where newlines are
>>>> always sentence breaks), and is a slight improvement on notes where
>>>> newlines may or may not be sentence breaks. Once the change is checked in
>>>> we can continue improving the model by adding more data and features, but
>>>> the first hurdle I'd like to get past is making sure it runs well enough
on
>>>> the type of data that the old model worked well on. Let me know if you have
>>>> any questions.
>>>>>>> Thanks
>>>>>>> Tim
>


Mime
View raw message