ctakes-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <Timothy.Mil...@childrens.harvard.edu>
Subject Re: Paragraph Chunking in cTAKES
Date Wed, 23 Sep 2015 18:37:10 GMT
Hi Lewis,
I'm not sure there is a single correct answer here. But my experience is that for certain
clinical datasets with unusual formatting (even w/o the Tika step) that cause problems.

One problem is headers/footers/ascii tables. The problem here is that they get called "sentences,"
and maybe the parser will try to "parse" it, which is annoying but probably not a big deal
for any downstream components -- cTAKES typically isn't claiming to find any relations or
clinical entities in these sentences.

The other major problem is sentence detection. There is a known problem with data in which
line breaks ('\n' and '\r' characters) by rule create sentence breaks and in certain datasets
that is not a valid rule. This _is_ a problem for downstream components because the dictionary
and most relation extractors work _within_ sentences, so an incorrect sentence break can lead
to a missed entity or relation discovery.

I am presenting work at AMIA this year on a new system and some annotations I created for
fixing that problem. We are going to test that system on a new project for both speed and
accuracy and once we are satisfied with it I think it will eventually be the default ctakes
sentence detector. With that said, it may not totally solve the problem(s) you're dealing
with. If there is some kind of formatting where page 1 of a scanned document has the start
of a sentence, then page 2 has the end of that sentence, but there's header and footer information
between, we don't have a solution for you. Probably cTAKES will not segment those sentences
correctly.

I think there are at least two new types of components/systems that would be nice to have
some look into (though I am not sure they would be "interesting research problems" to any
funding agencies):

1) Linguistic information vs. non-linguistic information classifier -- segment a given text
file into the parts that should be processed linguistically and those that should not. This
could be a ctakes/uima component.

2) Scanned document preprocessor -- similar to above perhaps, but purpose-built for recovering
from the kinds of mistakes that occur in scanned documents. Like headers/footers going in
the middle of the narrative, odd punctuation characters, etc. It could be that by first finding
non-linguistic information and then carefully excising it you could get a resolution to this
problem but I don't work with enough of this data to have good intuitions about how hard it
will be.

Hope that is helpful.

Tim




On 09/23/2015 02:07 PM, Lewis John Mcgibbney wrote:
Hi Folks,

I am looking for some feedback on accuracy of cTAKES annotations over input text if the input
text is not properly formed paragraphs?
Is this known to significantly affect annotation accuracy/performance?
Does anyone have a 'golden' input example of where cTAKES works best for annotation accuracy
and performance?

My situation is as follows; right now I use Apache Tika to parse a multitude of document and
I feed the parse result from those documents into cTAKES for annotation purposes. Sometimes
Tika is not able to form paragraphs correctly as the paragraphs are split over a page.

Another example is when footer information (such as page numbers, DOI's, Journal names, etc.)
exists between pages.

Thanks for any feedback.
Lewis

--
Lewis


Mime
View raw message