Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 3065 invoked from network); 27 Jun 2007 13:43:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Jun 2007 13:43:33 -0000 Received: (qmail 66364 invoked by uid 500); 27 Jun 2007 13:43:36 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 66343 invoked by uid 500); 27 Jun 2007 13:43:36 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 66334 invoked by uid 99); 27 Jun 2007 13:43:36 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2007 06:43:36 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of lally.adam@gmail.com designates 64.233.162.238 as permitted sender) Received: from [64.233.162.238] (HELO nz-out-0506.google.com) (64.233.162.238) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2007 06:43:32 -0700 Received: by nz-out-0506.google.com with SMTP id o37so149084nzf for ; Wed, 27 Jun 2007 06:43:11 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=tkDAZpmcR3lX+jr9Prj6vPC6qU3w++5IQYrR/EWFHa54Y6MNc40/RCYrRdVxvHy69E5my7d03Q7CRbIpJ9NyPPIbspGEpz/eCYsnSwZQSOlCzW1kH/zKSD2gDF/oDutZn4bbl6qPwjAuhLMbRlcQzisc9sMZdLIXMA/TmndK5tg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=iyarvWjxjpVrqqiOqUJh+oWbh/EBJX6UFl39SHX/LwmkIjAYxZAg4uPWfQPOrOcCyCIQpR5gIxBbEMsRq7w/nrPnUKdGIBo+TRGfc5p7LKi7AhPOZuQCz4d2y5q+uVz1zYfHmnhPYzu2JBwow5Kwi25Z7/q2NTrmSLRAwUh3hKM= Received: by 10.114.94.1 with SMTP id r1mr522253wab.1182951791329; Wed, 27 Jun 2007 06:43:11 -0700 (PDT) Received: by 10.114.78.5 with HTTP; Wed, 27 Jun 2007 06:43:11 -0700 (PDT) Message-ID: <2787e08a0706270643i1f9e04fdp6929120db37b9236@mail.gmail.com> Date: Wed, 27 Jun 2007 09:43:11 -0400 From: "Adam Lally" Sender: lally.adam@gmail.com To: uima-user@incubator.apache.org Subject: Re: UIMA document loading strategy In-Reply-To: <1268e4410706250348k2b9844a7m7c98828a2c419666@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <1268e4410706250348k2b9844a7m7c98828a2c419666@mail.gmail.com> X-Google-Sender-Auth: 757a0e42174c58c8 X-Virus-Checked: Checked by ClamAV on apache.org On 6/25/07, Arthit Suriyawongkul wrote: > Hi, > > How UIMA load document to memory ? > Does it load the whole document at once, or it only read document > partially (sometime stream-like). > > Now I'm using GATE and sometimes got a problem if my document is very large, > as GATE trying to load the whole document into the memory first and > convert it to > its own representation. > My application doesn't need a knowledge of the whole document (like DOM), > but only takes data from a small-size window (e.g. less than 100 > characters) at a time. > > cheers, > Art > Hi Art, UIMA is flexible with respect to this. You can provide a CollectionReader that populates a CAS with however much text is appropriate for your application. So a single document could be split across many CASes in order to decrease the overall memory requirements. It's also possible to split a CAS into smaller CASes, do annotation on each, and then merge the results. The kind of component that does the split and merge is called a "CAS Multiplier". There's an example of this in the uimaj-examples project that comes with the download - see descriptors/cas_multiplier/Segment_Annotate_Merge_AE. This is described in the "CAS Multiplier Developer's Guide" section of the documentation. Another option is to consider using a "remote Sofa" (Sofa = subject of analysis). In this case the CAS just contains a URL to where the actual document lives, not the document text itself. See the "Annotations, Artifacts, and Sofas" section of the documentaiton. -Adam