incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Bethard <steven.beth...@Colorado.EDU>
Subject Re: type system changes needed to read SHARP data
Date Mon, 26 Nov 2012 23:03:54 GMT
A point of clarification: Almost everything we get from the SHARP human annotations is associated
with a span of text by the annotators. And we need to recover those spans of text with our
machine learning models. So in most cases, we need subtypes of Annotation, not subtypes of
TOP. This is perhaps the biggest issue with the current type system: the TOP subtypes contain
most of what we need, but the Annotation subtypes are often too impoverished to capture the
SHARP annotations.

On Nov 26, 2012, at 9:28 PM, "Wu, Stephen T., Ph.D." <> wrote:
>> * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab",
>> "Phenomena"
> "Devices" and "Phenomena" don't exist yet because they were not part of the
> CEM models.  I need input from someone on CEMs if we're to add these.
> "Clinical_attribute" -- is this what you're looking for:
> org.apache.ctakes.typesystem.type.refsem.Attribute
> It inherits from Element.

But Attribute is a TOP and we need an Annotation here. (An added concern is, does it really
make sense to have a raw Attribute, and not some specific sub-type like BodyLaterality or

> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab

But Lab is a TOP, and we need an Annotation here.

>> * I couldn't find a modifier type (or alternatively, an Annotation subclass)
>> for the Knowtator annotations "generic_class", "conditional_class",
>> "uncertainty_indicator_class", "distal_or_proximal", "Person",
>> "negation_indicator_class", "historyOf_indicator_class",
>> "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral",
>> "method_class", "device_class", "allergy_indicator_class", "Route", "Form",
>> "Strength", "Strength number", "Strength unit", "Frequency", "Frequency
>> number", "Frequency unit", "Value", "Value number", "Value unit",
>> "estimated_flag_indicator", "reference_range", "Date", "Status change",
>> "Duration", "Dosage".
> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
> "category" feature.

Should there be constants for each of these categories?

>> * I couldn't find a place for the normalized value of
> "generic_class", --> IdentifiedAnnotation:generic
> "conditional_class",  --> IdentifiedAnnotation:conditionl
> "uncertainty_indicator_class", --> IdentifiedAnnotation:uncertainty
> "negation_indicator_class",  --> IdentifiedAnnotation:polarity


> "distal_or_proximal", --> BodyLaterality:value
> "superior_or_inferior", --> BodyLaterality:value
> "dorsal_or_ventral", --> BodyLaterality:value
> "medial_or_lateral", --> BodyLaterality:value
> "device_class", --> ProcedureDevice:value

And then set the Modifier.normalizedForm to BodyLaterality or ProcedureDevice? Ok.

> "Person", --> Entity

But Entity is a TOP, not an Annotation.

>> After working with this data I think we should consider having separate UIMA
>> Annotation sub-types for each of the things that are Modifiers now. For
>> example, if we have a real Severity Annotation for textual mentions of
>> severity, then the CAS makes it easy to select these. We have exactly this use
>> case in relation extractor - we need just the Severity modifiers, excluding
>> all the other modifiers. Basically, I think the principle we should follow in
>> UIMA is:
>> "If you could imagine searching the CAS for something, then that something
>> should have it's own Annotation sub-type."
> It's a good point, and a relatively good principle, but we have decided
> against it in the past.  The reason is a countering principle:
> "Do not put locally used (component-specific) types in the CAS."

This principle is not relevant here. The types we're talking about are not used locally within
a single AnalysisEngine. They're read in from the SHARPKnowtatorXMLReader AnalysisEngine,
and used separately in the ModifierExtractorAnnotator AnalysisEngine, the DegreeOfRelationExtractorAnnotator
AnalysisEngine, EventAnnotator AnalysisEngine, TimeAnnotator AnalysisEngine, etc. So they
can't be local to a single AnalysisEngine, and they must be in the CAS.

> There is no garbage collection in UIMA (despite things being deleted from
> the index) and extra types will bloat the CAS system, though admittedly is
> not too terrible a bloating.

I don't see how garbage collection is relevant here. We're going to create exactly the same
number of Modifiers. It's just whether we create them as raw Modifiers or Modifier sub types.
Are you saying there's some significant extra cost to having extra types, even when the total
number of instances across all types is constant?

> Two doubts that could change my mind:
> 1) Do we envision evaluation of the Modifiers/attributes -- apart from the
> Named Entities they're attached to?  If so, we need to preserve this
> information right at the beginning.

That's exactly what I'm talking about with the severity modifiers. We have a severity modifier
extraction annotator, and we *do* need to evaluate its performance by comparing the severity
modifiers it extracts to those in the annotated data. (We need this annotator, just like we
need the UMLS entity annotator, so that our relation extraction annotator can find relations
between severities and UMLS entities.)

The same is essentially true for everything annotated in SHARP. It's all annotated with the
intention of training machine learning models to reproduce those annotations. So we really
do want everything that's in the Knowtator XML annotations to be loaded and accessible to
all our UIMA AnalysisEngines.

> 2) Will these modifiers be reusable downstream?

I'm not sure what you mean here. Are you suggesting that the type system should only have
types for things that external users of cTAKES might need, and that we shouldn't have types
for things that must be passed between different cTAKES AnalysisEngines?

If that's the case, I think this would be a step in a very wrong direction. In UIMA, anything
that has to be passed between AnalysisEngines should be declared in the type system. And the
whole point of having a type system is to ease the passing of this information. So hobbling
the types that we pass between cTAKES annotators just to reduce the size of the type system
for external users just doesn't make sense.

View raw message