Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 4959 invoked from network); 16 Jun 2008 21:34:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Jun 2008 21:34:19 -0000 Received: (qmail 75987 invoked by uid 500); 16 Jun 2008 21:34:20 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 75970 invoked by uid 500); 16 Jun 2008 21:34:20 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 75959 invoked by uid 99); 16 Jun 2008 21:34:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jun 2008 14:34:20 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of eaepstein@gmail.com designates 209.85.146.179 as permitted sender) Received: from [209.85.146.179] (HELO wa-out-1112.google.com) (209.85.146.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jun 2008 21:33:31 +0000 Received: by wa-out-1112.google.com with SMTP id m16so4762043waf.6 for ; Mon, 16 Jun 2008 14:33:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=ffkWDSCvnFygTRD4idpbFUyy8LkwMZx7NzfzCs4tbZQ=; b=S0cVz5AUoFLeLw62NPolbMK5OoktoQyBZxR2qs7LVi9I4oZQkx31uzggj3Esil60na lrfBvw93wKy1Na5FRUNGqcitjMFda68CuVGRbl3Xn4V1KXWfjN/nJkxm4btmR0Qp1Yys yWVASigDJ4TYqaHwr58dswALGS64lNCPSN8iw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=HE8NX6pWcfCS4eayBGGIcIAKRy+FeKkiAWjw0SBSoYm2I0dZShUxcEZwps1N1NZ2f+ nsaY/tCKaBcn6pd/idvqNjwOlBNZ+TS2Hx2XEU4GakWBqk7WUIxMF616WGODBruxQ5eG r/t9SBYeaBddg/b9Na1cwvbnkVkqsB05y9rFY= Received: by 10.115.23.19 with SMTP id a19mr6940447waj.200.1213652029179; Mon, 16 Jun 2008 14:33:49 -0700 (PDT) Received: by 10.114.25.15 with HTTP; Mon, 16 Jun 2008 14:33:49 -0700 (PDT) Message-ID: Date: Mon, 16 Jun 2008 17:33:49 -0400 From: "Eddie Epstein" To: uima-user@incubator.apache.org Subject: Re: Content segmentation In-Reply-To: <15378a7f0806161253x3b990bc4nae4dbdd2fed78b05@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_49111_11045761.1213652029170" References: <15378a7f0806161253x3b990bc4nae4dbdd2fed78b05@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_49111_11045761.1213652029170 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Yaakov, I wanted to find out if UIMA has any concept of content segmentation. > Some of the analysis processing is very memory and CPU intensive and > if the content happens to be huge (like a book), it will bring the > server to a crawl. > > So, I was wondering if the UIMA framework has any notion of breaking > up the content into smaller segments. > Content segmentation is a core concept in UIMA, with each CAS typically considered to contain an "artifact" to be analyzed. Something has to segment the input corpus into discrete artifacts. In the most common scenario, a "collection reader" at the front of the UIMA pipeline segments the input and initializes each CAS. For other scenarios the "CAS Multiplier", a more general segmentation component, is used to initialize CASes. A CAS Multiplier (CM) can be called at any point in a UIMA pipeline; indeed multiple CM components can be used in the same pipeline. Consider a scenario where a CM is given an input CAS with a pointer to a large audio file. The CM could read the audio file, segment at boundaries appropriate for subsequent analysis, and create new CASes with just the audio content for each segment. Note that the artifact to be analyzed, called the Subject of analysis (Sofa), does not have to reside in the CAS itself. UIMA supports the notion of "remote Sofas" represented in the CAS by a URI. UIMA also provides stream access methods for remote Sofa content which in Java simply map to URI stream reading. Hoping this actually addresses your question, Eddie ------=_Part_49111_11045761.1213652029170--