uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject big offsets efficiency, and multiple offsets
Date Wed, 04 Dec 2013 14:31:21 GMT
Hi, we're now starting the EUMSSI project, which deals with integrating 
annotation layers coming from audio, video and text analysis.

We're thinking to base it all on UIMA, having different views with 
separate audio, video, transcribed text, etc. sofas.  In order to align 
the different views we need to have a common offset specification that 
allows us to map e.g. character offsets to the corresponding timestamps.

In order to avoid float timestamps (which would mean we can't derive 
from Annotation) I was thinking of using audio/video frames with e.g. 
100 or 1000 frames/second.  Annotation has begin and end defined as 
signed 32 bit ints, leaving sufficient room for very long documents even 
at 1000 fps, so I don't think we're going to run into any limits there. 
  Is there anything that could become problematic when working with 
offsets that are probably quite a bit larger than what is typically 
found with character offsets?

Also, can I have several indexes on the same annotations in order to 
work with character offsets for text analysis, but then efficiently 
query for overlapping annotations from other views based on frame offsets?

Btw, if you're interested in the project we have a writeup (condensed 
from the project proposal) here: 
https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there 
will hopefully soon be some content on http://eumssi.eu/

Thanks,
Jens


Mime
View raw message