ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <ksa...@ksarma.com>
Subject Re: sentence detector model
Date Mon, 29 Sep 2014 18:28:25 GMT
That sounds like it would be perfect for this task

On Monday, September 29, 2014, Peter Szolovits <psz@mit.edu> wrote:

> I have a set of about 27K documents from MIMIC (circa 2009) in which I
> have replaced the weird PHI markers by synthesized pseudonymous data.
> These have natural sentence breaks (typically in the middle of lines),
> normal paragraph structure, bulleted lists, etc.  Assuming it goes to
> people who have signed the MIMIC DUA, I could provide these if you are
> interested.  --Pete Sz.
>
> On Sep 29, 2014, at 1:37 PM, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu <javascript:;>> wrote:
>
> > Some of them are a bit artificial for this task, with notes being
> > annotated as one sentence per line and offset punctuation. I think maybe
> > the 2008 and 2009 data might have original formatting though, with
> > newlines not always breaking sentences. That has certain advantages over
> > raw MIMIC for training since the PHI isn't so weirdly formatted, but
> > then again is not a mix of styles (that is, the styles of newline always
> > terminates sentence vs. sometimes terminates sentence). I think it would
> > still have to be paired with another dataset to be a representative
> sample.
> > Tim
> >
> > On 09/29/2014 01:24 PM, vijay garla wrote:
> >> Why not use the i2b2 corpora?
> >>
> >> On Monday, September 29, 2014, Dligach, Dmitriy <
> >> Dmitriy.Dligach@childrens.harvard.edu <javascript:;>> wrote:
> >>
> >>> Maybe creating a made-up set of sentences would be an option? That way
> we
> >>> could agree on the annotation of concrete cases. Although this would be
> >>> more of a unit test than a corpus.
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Sep 27, 2014, at 12:15, Miller, Timothy <
> >>> Timothy.Miller@childrens.harvard.edu <javascript:;> <javascript:;>>
> wrote:
> >>>
> >>>> I've just been using the opennlp command line cross validator on the
> >>> small dataset i annotated (along with some eyeballing). It would be
> cool if
> >>> there was a standard clinical resource available for this task, but I
> >>> hadn't considered it much because the data I annotated pulls from
> multiple
> >>> datasets and the process of  arranging with different institutions to
> make
> >>> something like that available would probably be a nightmare.
> >>>> Tim
> >>>>
> >>>> Sent from my iPad. Sorry about the typos.
> >>>>
> >>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
> >>> Dmitriy.Dligach@childrens.harvard.edu <javascript:;> <javascript:;>>
> wrote:
> >>>>> Tim, thanks for working on this!
> >>>>>
> >>>>> Question: do we have some formal way of evaluating the sentence
> >>> detector? Maybe we should come up with some dev set that would include
> >>> examples from mimic...
> >>>>> Dima
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
> >>> Timothy.Miller@childrens.harvard.edu <javascript:;> <javascript:;>>
> wrote:
> >>>>>> I have been working on the sentence detector newline issue,
> training a
> >>> model to probabilistically split sentences on newlines rather than
> forcing
> >>> sentence breaks. I have checked in a model to the repo under
> >>> ctakes-core-res. I also attached a patch to ctakes-core to the jira
> issue:
> >>>>>> https://issues.apache.org/jira/browse/CTAKES-41
> >>>>>>
> >>>>>> for people to test. The status of my testing is that it doesn't
seem
> >>> to break on notes where ctakes worked well before (those where
> newlines are
> >>> always sentence breaks), and is a slight improvement on notes where
> >>> newlines may or may not be sentence breaks. Once the change is checked
> in
> >>> we can continue improving the model by adding more data and features,
> but
> >>> the first hurdle I'd like to get past is making sure it runs well
> enough on
> >>> the type of data that the old model worked well on. Let me know if you
> have
> >>> any questions.
> >>>>>> Thanks
> >>>>>> Tim
> >>>
> >
>
>

-- 




--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message