Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B0B05102B5 for ; Wed, 4 Dec 2013 14:33:50 +0000 (UTC) Received: (qmail 94481 invoked by uid 500); 4 Dec 2013 14:32:16 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 94420 invoked by uid 500); 4 Dec 2013 14:32:03 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 93765 invoked by uid 99); 4 Dec 2013 14:32:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Dec 2013 14:32:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gcaug-uima-user@m.gmane.org designates 80.91.229.3 as permitted sender) Received: from [80.91.229.3] (HELO plane.gmane.org) (80.91.229.3) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Dec 2013 14:31:56 +0000 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1VoDUF-0005kj-Nc for user@uima.apache.org; Wed, 04 Dec 2013 15:31:35 +0100 Received: from vpnout104.upf.edu ([193.145.49.104]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 04 Dec 2013 15:31:35 +0100 Received: from j+asf by vpnout104.upf.edu with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 04 Dec 2013 15:31:35 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: user@uima.apache.org From: Jens Grivolla Subject: big offsets efficiency, and multiple offsets Date: Wed, 04 Dec 2013 15:31:21 +0100 Lines: 28 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: vpnout104.upf.edu User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.1.1 X-Virus-Checked: Checked by ClamAV on apache.org Hi, we're now starting the EUMSSI project, which deals with integrating annotation layers coming from audio, video and text analysis. We're thinking to base it all on UIMA, having different views with separate audio, video, transcribed text, etc. sofas. In order to align the different views we need to have a common offset specification that allows us to map e.g. character offsets to the corresponding timestamps. In order to avoid float timestamps (which would mean we can't derive from Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 frames/second. Annotation has begin and end defined as signed 32 bit ints, leaving sufficient room for very long documents even at 1000 fps, so I don't think we're going to run into any limits there. Is there anything that could become problematic when working with offsets that are probably quite a bit larger than what is typically found with character offsets? Also, can I have several indexes on the same annotations in order to work with character offsets for text analysis, but then efficiently query for overlapping annotations from other views based on frame offsets? Btw, if you're interested in the project we have a writeup (condensed from the project proposal) here: https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will hopefully soon be some content on http://eumssi.eu/ Thanks, Jens