uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Deduplicating Annotations With Same coveredText
Date Tue, 22 Apr 2014 21:10:56 GMT
If you plan on running your pipeline in one JVM (rather than having it scaled
out over multiple JVMs), you can consider using an external resource which would
be a plain Java Set<String> of the unique covered text so far found.  Then, in
the annotator (or annotators) that are adding new FeatureStructures representing
the possibly duplication annotation, you can first check the shared resource to
see if its been already annotated, and if so, skip both creating the additional
FeatureStructure, and adding it to the indexes.

Would that work for your use case?

On 4/21/2014 10:20 PM, Petr Baudis wrote:
>   Hi!
>   I'm facing a task of deduplicating annotations that have the same
> getCoveredText() value (possibly at different sofa locations) - I'd
> like to keep just a single of each; for example if I were to make
> a bag-of-words with only single annotation per word and number of
> occurences as a feature.  (Or, in my case, the annotations are scored
> candidate answers in a QA system that I'd like to merge if they are
> textually the same.)
>   Is there a better way than simply loading all annotations of the type
> to a java map, mass-dropping them from indexes, then readding some of
> them?
>   My idea was to simply index them by coveredText and then by sequential
> iteration, it's enough to just compare getCoveredText() of current and
> previous annotation to decide whether to merge them. However, it appears
> that coveredText is not supported as a key feature, I'd have to make an
> explicit copy of it as a separate feature. Is there any other option?
>   Thanks,
> 				Petr "Pasky" Baudis

View raw message