ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: training data for sentence detector
Date Mon, 10 Feb 2014 02:13:43 GMT

That sounds like a good plan.  Of the data used to train the first cTAKES sentence detector
(prior to Apache cTAKES) there were less than 8000 sentences from clinical notes. 

Also of interest may be this table which shows that GENIA, PTB, and Mayo Clinic data were
all used for that model.

http://jamia.bmj.com/content/17/5/507/T2.expansion.html 

-- James

-----Original Message-----
From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu] 
Sent: Friday, February 07, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: training data for sentence detector

James,
We were discussing the sentence detector thing in person here the other 
day and Pei had a thought that depending on what sources you were using 
for training the sentence detector, we might be able to do something 
equivalent here in Boston by using SHARP, THYME, MIPACQ data which are 
largely from Mayo and probably similar to what you use, then augmenting 
with the little bit of MIMIC that I annotated. I don't know how that 
compares size-wise to the dataset that you are using. Is it quite large 
or do you think if we use derived data from those other projects will we 
be good? What do you think of this plan? Anyone else?
Tim


Mime
View raw message