Return-Path: Delivered-To: apmail-incubator-uima-commits-archive@locus.apache.org Received: (qmail 75075 invoked from network); 1 Dec 2006 16:47:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Dec 2006 16:47:15 -0000 Received: (qmail 8821 invoked by uid 500); 1 Dec 2006 16:47:24 -0000 Delivered-To: apmail-incubator-uima-commits-archive@incubator.apache.org Received: (qmail 8798 invoked by uid 500); 1 Dec 2006 16:47:24 -0000 Mailing-List: contact uima-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-dev@incubator.apache.org Delivered-To: mailing list uima-commits@incubator.apache.org Received: (qmail 8677 invoked by uid 99); 1 Dec 2006 16:47:23 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Dec 2006 08:47:23 -0800 X-ASF-Spam-Status: No, hits=-9.4 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO eris.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Dec 2006 08:47:12 -0800 Received: by eris.apache.org (Postfix, from userid 65534) id 57F401A9846; Fri, 1 Dec 2006 08:46:33 -0800 (PST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r481286 - /incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml Date: Fri, 01 Dec 2006 16:46:32 -0000 To: uima-commits@incubator.apache.org From: alally@apache.org X-Mailer: svnmailer-1.1.0 Message-Id: <20061201164633.57F401A9846@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: alally Date: Fri Dec 1 08:46:30 2006 New Revision: 481286 URL: http://svn.apache.org/viewvc?view=rev&rev=481286 Log: UIMA-68: added documentation for using a CAS Multiplier to Merge CASes http://issues.apache.org/jira/browse/UIMA-68 Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml URL: http://svn.apache.org/viewvc/incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml?view=diff&rev=481286&r1=481285&r2=481286 ============================================================================== --- incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml (original) +++ incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/uima/tug.cas_multiplier.xml Fri Dec 1 08:46:30 2006 @@ -26,25 +26,22 @@ CAS Multiplier Developer's Guide - The UIMA analysis components (Annotators and CAS Consumers) described previously - in this manual all take a single CAS as input, optionally make modifications to it, and - output that same CAS. This chapter describes an advanced feature that became available in - the UIMA SDK v2.0: a new type of analysis component called a CAS - Multiplier, which can create new CASes during processing. + The UIMA analysis components (Annotators and CAS Consumers) described previously in this manual all take a + single CAS as input, optionally make modifications to it, and output that same CAS. This chapter describes an + advanced feature that became available in the UIMA SDK v2.0: a new type of analysis component called a + CAS Multiplier, which can create new CASes during processing. - CAS Multipliers are often used to split a large artifact into manageable pieces. This - is a common requirement of audio and video analysis applications, but can also occur in - text analysis on very large documents. A CAS Multiplier would take as input a single CAS - representing the large artifact (perhaps by a remote reference to the actual data — - see ) and produce as output a series of new - CASes each of which contains only a small portion of the original artifact. + CAS Multipliers are often used to split a large artifact into manageable pieces. This is a common requirement + of audio and video analysis applications, but can also occur in text analysis on very large documents. A CAS + Multiplier would take as input a single CAS representing the large artifact (perhaps by a remote reference to the + actual data — see ) and produce as output a series of new CASes each of which + contains only a small portion of the original artifact. - CAS Multipliers are not limited to dividing an artifact into smaller pieces, - however. A CAS Multiplier can also be used to combine smaller segments together to form - larger segments. In general, a CAS Multiplier is used to change - the segmentation of a series of CASes; that is, to change how a stream of data is divided - among discrete CAS objects. + CAS Multipliers are not limited to dividing an artifact into smaller pieces, however. A CAS Multiplier can + also be used to combine smaller segments together to form larger segments. In general, a CAS Multiplier is used to + change the segmentation of a series of CASes; that is, to change how a stream of data is + divided among discrete CAS objects.
Developing the CAS Multiplier Code @@ -53,87 +50,83 @@ CAS Multiplier Interface Overview CAS Multiplier implementations should extend from the - JCasMultiplier_ImplBase or - CasMultiplier_ImplBase classes, depending on which CAS - interface they prefer to use. As with other types of analysis components, the CAS - Multiplier ImplBase classes define optional initialize, - destroy, and reconfigure methods. - There are then three required methods: process, - hasNext, and next. The framework - interacts with these methods as follows: - - The framework calls the CAS Multiplier's - process method, passing it an input CAS. The process method - returns, but may hold on to a reference to the input CAS. - - The framework then calls the CAS Multiplier's - hasNext method. The CAS Multiplier should return - true from this method if it intends to output one or more new - CASes (for instance, segments of this CAS), and false if - not. - - If hasNext returned true, the framework - will call the CAS Multiplier's next method. The CAS - Multiplier creates a new CAS (we will see how in a moment), populates it, and returns - it from the hasNext method. - - Steps 2 and 3 continue until hasNext returns - false. - - From the time when process is called until the - hasNext method returns false, the CAS Multiplier - owns the CAS that was passed to its process - method. The CAS Multiplier can store a reference to this CAS in a local field and can - read from it or write to it during this time. Once hasNext - returns false, the CAS Multiplier gives up ownership of the input CAS and should no - longer retain a reference to it. + JCasMultiplier_ImplBase or CasMultiplier_ImplBase + classes, depending on which CAS interface they prefer to use. As with other types of analysis components, the + CAS Multiplier ImplBase classes define optional initialize, + destroy, and reconfigure methods. There are then three + required methods: process, hasNext, and + next. The framework interacts with these methods as follows: + + + + The framework calls the CAS Multiplier's process method, passing it an + input CAS. The process method returns, but may hold on to a reference to the input CAS. + + + + The framework then calls the CAS Multiplier's hasNext method. The CAS + Multiplier should return true from this method if it intends to output one or more + new CASes (for instance, segments of this CAS), and false if not. + + + + If hasNext returned true, the framework will call the CAS Multiplier's + next method. The CAS Multiplier creates a new CAS (we will see how in a moment), + populates it, and returns it from the hasNext method. + + + + Steps 2 and 3 continue until hasNext returns false. + + + + From the time when process is called until the hasNext + method returns false, the CAS Multiplier owns the CAS that was passed to its + process method. The CAS Multiplier can store a reference to this CAS in a local field and + can read from it or write to it during this time. Once hasNext returns false, the CAS + Multiplier gives up ownership of the input CAS and should no longer retain a reference to it.
How to Get an Empty CAS Instance - The CAS Multiplier's next method must return a CAS - instance that represents a new representation of the input artifact. Since CAS - instances are managed by the framework, the CAS Multiplier cannot actually create a - new CAS; instead it should request an empty CAS by calling the method: - + The CAS Multiplier's next method must return a CAS instance that represents + a new representation of the input artifact. Since CAS instances are managed by the framework, the CAS + Multiplier cannot actually create a new CAS; instead it should request an empty CAS by calling the method: CAS getEmptyCAS() or -JCas getEmptyJCas() - which are defined on the CasMultiplier_ImplBase and +JCas getEmptyJCas() which are + defined on the CasMultiplier_ImplBase and JCasMultiplier_ImplBase classes, respectively. - Note that if it is more convenient you can request an empty CAS during the - process or hasNext methods, not just - during the next method. - - By default, a CAS Multiplier is only allowed to hold one output CAS instance at a - time. You must return the CAS from the next method before you can - request a second CAS. If you try to call getEmptyCAS a second time you will get an - Exception. You can change this default behavior by overriding the method - getCasInstancesRequired to return the number of CAS - instances that you need. Be aware that CAS instances consume a significant amount of - memory, so setting this to a large value will cause your application to use a lot of RAM. - So, for example, it is not a good practice to attempt to generate a large number of new - CASes in the CAS Multiplier's process method. Instead, - you should spread your processing out across the calls to the - hasNext or next methods. + Note that if it is more convenient you can request an empty CAS during the process or + hasNext methods, not just during the next method. + + By default, a CAS Multiplier is only allowed to hold one output CAS instance at a time. You must return the + CAS from the next method before you can request a second CAS. If you try to call + getEmptyCAS a second time you will get an Exception. You can change this default behavior by overriding the + method getCasInstancesRequired to return the number of CAS instances that you need. + Be aware that CAS instances consume a significant amount of memory, so setting this to a large value will cause + your application to use a lot of RAM. So, for example, it is not a good practice to attempt to generate a large + number of new CASes in the CAS Multiplier's process method. Instead, you should + spread your processing out across the calls to the hasNext or + next methods.
Example Code - This section walks through the source code of an example CAS Multiplier that - breaks text documents into smaller pieces. The Java class for the example is - org.apache.uima.examples.casMultiplier.SimpleTextSegmenter - and the source code is included in the UIMA SDK under the - examples/src directory. + This section walks through the source code of an example CAS Multiplier that breaks text documents into + smaller pieces. The Java class for the example is + org.apache.uima.examples.casMultiplier.SimpleTextSegmenter and the source + code is included in the UIMA SDK under the examples/src directory. -
Overall Structure +
+ Overall Structure public class SimpleTextSegmenter extends JCasMultiplier_ImplBase { @@ -157,13 +150,14 @@ The SimpleTextSegmenter class extends JCasMultiplier_ImplBase and implements the optional - initialize method as well as the required - process, hasNext, and - next methods. Each method is described below. + initialize method as well as the required process, + hasNext, and next methods. Each method is described + below.
-
Initialize Method +
+ Initialize Method public void initialize(UimaContext aContext) throws @@ -173,15 +167,15 @@ "segmentSize")).intValue(); } - Like an Annotator, a CAS Multiplier can override the initialize method and - read configuration parameter values from the UimaContext. The - SimpleTextSegmenter defines one parameter, Segment Size, - which determines the approximate size (in characters) of each segment that it will + Like an Annotator, a CAS Multiplier can override the initialize method and read configuration + parameter values from the UimaContext. The SimpleTextSegmenter defines one parameter, Segment + Size, which determines the approximate size (in characters) of each segment that it will produce.
-
Process Method +
+ Process Method public void process(JCas aJCas) @@ -201,93 +195,90 @@ } } - The process method receives a new JCas to be processed(segmented) by this CAS - Multiplier. The SimpleTextSegmenter extracts some information from this JCas - and stores it in fields (the document text is stored in the field mDoc and the source - URI in the field mDocURI). Recall that the CAS Multiplier is considered to - own the JCas from the time when process is called until the time - when hasNext returns false. Therefore it is acceptable to retain references to - objects from the JCas in a CAS Multiplier, whereas this should never be done in an - Annotator. The CAS Multiplier could have chosen to store a reference to the JCas - itself, but that was not necessary for this example. - - The CAS Multiplier also initializes the mPos variable to 0. This variable is a - position into the document text and will be incremented as each new segment is - produced. + The process method receives a new JCas to be processed(segmented) by this CAS Multiplier. The + SimpleTextSegmenter extracts some information from this JCas and stores it in fields (the document text + is stored in the field mDoc and the source URI in the field mDocURI). Recall that the CAS Multiplier is + considered to own the JCas from the time when process is called until the time when hasNext + returns false. Therefore it is acceptable to retain references to objects from the JCas in a CAS + Multiplier, whereas this should never be done in an Annotator. The CAS Multiplier could have chosen to + store a reference to the JCas itself, but that was not necessary for this example. + + The CAS Multiplier also initializes the mPos variable to 0. This variable is a position into the + document text and will be incremented as each new segment is produced.
-
HasNext Method +
+ HasNext Method public boolean hasNext() throws AnalysisEngineProcessException { return mPos < mDoc.length(); } - The job of the hasNext method is to report whether there are any additional - output CASes to produce. For this example, the CAS Multiplier will break the entire - input document into segments, so we know there will always be a next segment until - the very end of the document has been reached. + The job of the hasNext method is to report whether there are any additional output CASes to produce. For + this example, the CAS Multiplier will break the entire input document into segments, so we know there will + always be a next segment until the very end of the document has been reached.
-
Next Method +
+ Next Method - public AbstractCas next() throws AnalysisEngineProcessException { - int breakAt = mPos + mSegmentSize; - if (breakAt > mDoc.length()) - breakAt = mDoc.length(); - - // Search for the next newline character. Note: this example - // segmenter implementation assumes that the document contains many - // newlines. In the worst case, if this segmenter is run on a - // document with no newlines, it will produce only one segment - // containing the entire document text. A better implementation - // might specify a maximum segment size as well as a minimum. - - while (breakAt < mDoc.length() && mDoc.charAt(breakAt-1) != 'n') - breakAt++; - - JCas jcas = getEmptyJCas(); - try { - jcas.setDocumentText(mDoc.substring(mPos, breakAt)); - //if original CAS had SourceDocumentInformation, - //also add SourceDocumentInformation to each segment - if (mDocUri != null) { - SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); - sdi.setUri(mDocUri); - sdi.setOffsetInSource(mPos); - sdi.setDocumentSize(breakAt - mPos); - sdi.addToIndexes(); + public AbstractCas next() throws AnalysisEngineProcessException { + int breakAt = mPos + mSegmentSize; + if (breakAt > mDoc.length()) + breakAt = mDoc.length(); + // search for the next newline character. Note: this example segmenter implementation + // assumes that the document contains many newlines. In the worst case, if this segmenter + // is runon a document with no newlines, it will produce only one segment containing the + // entire document text. A better implementation might specify a maximum segment size as + // well as a minimum. + while (breakAt < mDoc.length() && mDoc.charAt(breakAt - 1) != '\n') + breakAt++; + + JCas jcas = getEmptyJCas(); + try { + jcas.setDocumentText(mDoc.substring(mPos, breakAt)); + // if original CAS had SourceDocumentInformation, also add SourceDocumentInformatio + // to each segment + if (mDocUri != null) { + SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); + sdi.setUri(mDocUri); + sdi.setOffsetInSource(mPos); + sdi.setDocumentSize(breakAt - mPos); + sdi.addToIndexes(); + + if (breakAt == mDoc.length()) { + sdi.setLastSegment(true); + } + } + + mPos = breakAt; + return jcas; + } catch (Exception e) { + jcas.release(); + throw new AnalysisEngineProcessException(e); } - - mPos = breakAt; - return jcas; - } - catch(Exception e) { - jcas.release(); - throw new AnalysisEngineProcessException(e); - } -} + } - The next method actually produces the next segment and - returns it. The framework guarantees that it will not call - next unless hasNext has returned true - since the last call to process or next - . + The next method actually produces the next segment and returns it. The + framework guarantees that it will not call next unless + hasNext has returned true since the last call to process or + next . - Note that in order to produce a segment, the CAS Multiplier must get an empty - JCas to populate. This is done by the line: + Note that in order to produce a segment, the CAS Multiplier must get an empty JCas to populate. This is + done by the line: JCas jcas = getEmptyJCas(); - This requests an empty JCas from the framework, which maintains a pool of JCas - instances to draw from. + This requests an empty JCas from the framework, which maintains a pool of JCas instances to draw + from. - Also, note the use of the try...catch block to ensure - that a JCas is released back to the pool if an exception occurs. This is very - important to allow a CAS Multiplier to recover from errors. + Also, note the use of the try...catch block to ensure that a JCas is released back + to the pool if an exception occurs. This is very important to allow a CAS Multiplier to recover from + errors.
@@ -296,21 +287,19 @@
Creating the CAS Multiplier Descriptor - There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are - considered a type of Analysis Engine, and so their descriptors use the same syntax as any - other Analysis Engine Descriptor. - - The descriptor for the SimpleTextSegmenter is located in the - examples/descriptors/cas_multiplier/SimpleTextSegmenter.xml directory - of the UIMA SDK. - - The Analysis Engine Description, in its Operational Properties - section, now contains a new outputsNewCASes property which takes a - Boolean value. If the Analysis Engine is a CAS Multiplier, this property should be set to - true. + There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are considered a type of + Analysis Engine, and so their descriptors use the same syntax as any other Analysis Engine Descriptor. + + The descriptor for the SimpleTextSegmenter is located in the + examples/descriptors/cas_multiplier/SimpleTextSegmenter.xml directory of the + UIMA SDK. + + The Analysis Engine Description, in its Operational Properties section, now contains a + new outputsNewCASes property which takes a Boolean value. If the Analysis Engine is a CAS + Multiplier, this property should be set to true. - If you use the CDE, be sure to check the Outputs new CASes box in the - Runtime Information section on the Overview page, as shown here: + If you use the CDE, be sure to check the Outputs new CASes box in the Runtime Information + section on the Overview page, as shown here: @@ -325,38 +314,44 @@ If you edit the Analysis Engine Descriptor by hand, you need to add a - <outputsNewCASes> element to your descriptor as shown - here: + <outputsNewCASes> element to your descriptor as shown here: - <operationalProperties> - <modifiesCas>false</modifiesCas> - <multipleDeploymentAllowed>true</multipleDeploymentAllowed> - <outputsNewCASes>true</outputsNewCASes> + + <operationalProperties> + <modifiesCas>false</modifiesCas> + <multipleDeploymentAllowed>true</multipleDeploymentAllowed> + <outputsNewCASes>true</outputsNewCASes> </operationalProperties> - The modifiedCas operational property refers to the input - CAS, not the new output CASes produced. So our example SimpleTextSegmenter has - modifiesCas set to false since it doesn't modify the input CAS. + + The modifiedCas operational property refers to the input CAS, not the new output CASes + produced. So our example SimpleTextSegmenter has modifiesCas set to false since it doesn't modify the + input CAS.
Using a CAS Multiplier in an Aggregate Analysis Engine - You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For - example, this allows you to construct an Aggregate Analysis Engine that takes each - input CAS, breaks it up into segments, and runs a series of Annotators on each - segment. + You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For example, this allows + you to construct an Aggregate Analysis Engine that takes each input CAS, breaks it up into segments, and runs a + series of Annotators on each segment.
Adding the CAS Multiplier to the Aggregate - Since CAS Multiplier are considered a type of Analysis Engine, adding them to an - aggregate works the same way as for other Analysis Engines. Using the CDE, you just - click the Add... button in the Component Engines view and browse to - the Analysis Engine Descriptor of your CAS Multiplier. If editing the aggregate - descriptor directly, just import the Analysis Engine - Descriptor of your CAS Multiplier as usual. + Since CAS Multiplier are considered a type of Analysis Engine, adding them to an aggregate works the same + way as for other Analysis Engines. Using the CDE, you just click the Add... button in the + Component Engines view and browse to the Analysis Engine Descriptor of your CAS Multiplier. If editing the + aggregate descriptor directly, just import the Analysis Engine Descriptor of your + CAS Multiplier as usual. + + An example descriptor for an Aggregate Analysis Engine containing a CAS Multiplier is provided in + examples/descriptors/cas_multiplier/SegmenterAndTokenizerAE.xml. This + Aggregate runs the SimpleTextSegmenter example to break a large document into + segments, and then runs each segment through the SimpleTokenAndSentenceAnnotator. + Try running it in the Document Analyzer tool with a large text file as input, to see that it outputs multiple + output CASes, one for each segment produced by the SimpleTextSegmenter.
@@ -388,15 +383,22 @@ that implement's UIMA's default flow defines a configuration parameter ActionAfterCasMultiplier that can take the following values: - continue – the CAS continues on to the next element in the - flow - stop – the CAS will no longer continue in the flow, and will be - returned from the aggregate if possible. - drop – the CAS will no longer continue in the flow, and will be dropped - (not returned from the aggregate) if possible. - dropIfNewCasProduced (the default) – if the CAS multiplier - produced a new CAS as a result of processing this CAS, then this CAS will be dropped. If not, then this CAS will - continue. + + continue – the CAS continues on to the next element in the flow + + + stop – the CAS will no longer continue in the flow, and will be returned + from the aggregate if possible. + + + drop – the CAS will no longer continue in the flow, and will be dropped + (not returned from the aggregate) if possible. + + + dropIfNewCasProduced (the default) – if the CAS multiplier produced + a new CAS as a result of processing this CAS, then this CAS will be dropped. If not, then this CAS will + continue. + You can override this parameter in your Aggregate Analysis Engine the same way you would override a @@ -404,6 +406,7 @@ FixedFlowController implementation by importing its descriptor into your aggregate as follows: + <flowController key="FixedFlowController"> <import name="org.apache.uima.flow.FixedFlowController"/> @@ -411,6 +414,8 @@ The parameter could then be overriden as, for example: + + <configurationParameters> <configurationParameter> @@ -434,8 +439,8 @@ </configurationParameterSettings> - This overriding can also be done using the Component Descriptor Editor tool. - An example of an Analysis Engine that overrides this parameter can be found in + This overriding can also be done using the Component Descriptor Editor tool. An example of an Analysis + Engine that overrides this parameter can be found in examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml. For more information about how to specify a flow controller as part of your Aggregate Analysis Engine descriptor, see . @@ -447,33 +452,28 @@
-
Aggregate CAS Multipliers +
+ Aggregate CAS Multipliers - An important consideration when you put a CAS Multiplier inside an Aggregate - Analysis Engine is whether you want the Aggregate to also function as a CAS Multiplier - – that is, whether you want the new output CASes produced within the Aggregate - to be output from the Aggregate. This is controlled by the - <outputsNewCASes> element in the Operational - Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as - what was described in - . - - If you set this property to true, then any new output CASes - produced by a CAS Multiplier inside this Aggregate will be output from the Aggregate. - Thus the Aggregate will function as a CAS Multiplier and can be used in any of the ways in - which a primitive CAS Multiplier can be used. - - If you set the <outputsNewCASes> property to false - , then any new output CASes produced by a CAS Multiplier inside the Aggregate will be - dropped (i.e. the CASes will be released back to the pool) once they have finished - being processed. Such an Aggregate Analysis Engine functions just like a - normal non-CAS-Multiplier Analysis Engine; the fact that CAS - Multiplication is occurring inside it is hidden from users of that Analysis - Engine. - If you want to output some new Output CASes and not others, you need to - implement a custom Flow Controller that makes this decision — see . - + An important consideration when you put a CAS Multiplier inside an Aggregate Analysis Engine is whether + you want the Aggregate to also function as a CAS Multiplier + – that is, whether you want the new output CASes produced within the Aggregate to be output from the + Aggregate. This is controlled by the <outputsNewCASes> element in the + Operational Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as what was + described in . + + If you set this property to true, then any new output CASes produced by a CAS + Multiplier inside this Aggregate will be output from the Aggregate. Thus the Aggregate will function as a CAS + Multiplier and can be used in any of the ways in which a primitive CAS Multiplier can be used. + + If you set the <outputsNewCASes> property to false , then any new output + CASes produced by a CAS Multiplier inside the Aggregate will be dropped (i.e. the CASes will be released back + to the pool) once they have finished being processed. Such an Aggregate Analysis Engine functions just like a + normal non-CAS-Multiplier Analysis Engine; the fact that CAS Multiplication is + occurring inside it is hidden from users of that Analysis Engine. + If you want to output some new Output CASes and not others, you need to implement a custom Flow Controller + that makes this decision — see .
@@ -481,46 +481,42 @@
Using a CAS Multiplier in a Collection Processing Engine - It is currently a limitation that CAS Multiplier cannot be deployed directly in a - Collection Processing Engine. The only way that you can use a CAS Multiplier in a CPE is to - first wrap it in an Aggregate Analysis Engine whose outputsNewCASes - property is set to false, which in effect hides the - existence of the CAS Multiplier from the CPE. - - Note that you can build an Aggregate Analysis Engine that consists of CAS - Multipliers and Annotators, followed by CAS Consumers. This can simulate what a CPE - would do, but without the deployment and error handling options that the CPE - provides. + It is currently a limitation that CAS Multiplier cannot be deployed directly in a Collection Processing + Engine. The only way that you can use a CAS Multiplier in a CPE is to first wrap it in an Aggregate Analysis Engine + whose outputsNewCASes property is set to false, which in effect + hides the existence of the CAS Multiplier from the CPE. + + Note that you can build an Aggregate Analysis Engine that consists of CAS Multipliers and Annotators, + followed by CAS Consumers. This can simulate what a CPE would do, but without the deployment and error handling + options that the CPE provides.
Calling a CAS Multiplier from an Application - The AnalysisEngine interface has the following methods - that allow you to interact with CAS Multiplier: - CasIterator - processAndOutputNewCASes(CAS) + The AnalysisEngine interface has the following methods that allow you to interact + with CAS Multiplier: + + + CasIterator processAndOutputNewCASes(CAS) - JCasIterator - processAndOutputNewCASes(JCas) + + JCasIterator processAndOutputNewCASes(JCas) - From your application, you call processAndOutputNewCASes - and pass it the input CAS. An iterator is returned that allows you to step through each of - the new output CASes that are produced by the Analysis Engine. - - It is very important to realize that CASes are pooled objects and so your - application must release each CAS (by calling the CAS.release() - method) that it obtains from the CasIterator before it calls - the CasIterator.next method again. Otherwise, the CAS pool will - be exhausted and a deadlock will occur. - - The example code in the class - org.apache.uima.examples.casMultiplier. - CasMultiplierExampleApplication illusrates this. Here is the main - processing loop: + From your application, you call processAndOutputNewCASes and pass it the input + CAS. An iterator is returned that allows you to step through each of the new output CASes that are produced by the + Analysis Engine. + + It is very important to realize that CASes are pooled objects and so your application must release each CAS + (by calling the CAS.release() method) that it obtains from the CasIterator + before it calls the CasIterator.next method again. Otherwise, + the CAS pool will be exhausted and a deadlock will occur. + + The example code in the class org.apache.uima.examples.casMultiplier. + CasMultiplierExampleApplication illusrates this. Here is the main processing loop: CasIterator casIterator = ae.processAndOutputNewCASes(initialCas); @@ -536,24 +532,209 @@ outCas.release(); Note that as defined by the CAS Multiplier contract in , the CAS Multiplier owns the - input CAS (initialCAS in the example) until the last new output - CAS has been produced. This means that the application should not try to make changes to - initialCAS until after the - CasIterator.hasNext method has returned false, indicating - that the segmenter has finished. - - Note that the processing time of the Analysis Engine is spread out over the calls to - the CasIterator's hasNext and next - methods. That is, the next output CAS may not actually be produced and annotated until - the application asks for it. So the application should not expect calls to the - CasIterator to necessarily complete quickly. - - Also, calls to the CasIterator may throw Exceptions - indicating an error has occurred during processing. If an Exception is thrown, all - processing of the input CAS will stop, and no more output CASes will be produced. There is - currently no error recovery mechanism that will allow processing to continue after an - exception. + linkend="ugr.tug.cm.cm_interface_overview"/>, the CAS Multiplier owns the input CAS + (initialCAS in the example) until the last new output CAS has been produced. This means + that the application should not try to make changes to initialCAS until after the + CasIterator.hasNext method has returned false, indicating that the segmenter has + finished. + + Note that the processing time of the Analysis Engine is spread out over the calls to the + CasIterator's hasNext and next methods. That is, the next + output CAS may not actually be produced and annotated until the application asks for it. So the application + should not expect calls to the CasIterator to necessarily complete quickly. + + Also, calls to the CasIterator may throw Exceptions indicating an error has + occurred during processing. If an Exception is thrown, all processing of the input CAS will stop, and no more + output CASes will be produced. There is currently no error recovery mechanism that will allow processing to + continue after an exception. +
+ +
+ Using a CAS Multiplier to Merge CASes + A CAS Multiplier can also be used to combine smaller CASes together to form larger CASes. In this section we + describe how this works and walk through an example. + +
+ Overview of How to Merge CASes + + + + When the framework first calls the CAS Multiplier's process method, + the CAS Multiplier requests an empty CAS (which we'll call the "merged CAS") and copies relevant data + from the input CAS into the merged CAS. The class + org.apache.uima.util.CasCopier provides utilities for copying Feature + Structures between CASes. + + + + When the framework then calls the CAS Multiplier's hasNext method, the + CAS Multiplier returns false to indicate that it has no output at this + time. + + + + When the framework calls process again with a new input CAS, the CAS + Multiplier copies data from that input CAS into the merged CAS, combining it with the data that was + previously copied. + + + + Eventually, when the CAS Multiplier decides that it wants to output the merged CAS, it returns + true from the hasNext method, and then when the framework + subsequently calls the next method, the CAS Multiplier returns the merged + CAS. + + + There is no explicit call to flush out any pending CASes from a CAS Multiplier when collection processing + completes. It is up to the application to provide some mechanism to let a CAS Multiplier recognize the last CAS + in a collection so that it can ensure that its final output CASes are complete. +
+
+ Example CAS Merger + An example CAS Multiplier that merges CASes can be found is provided in the UIMA SDK. The Java class for + this example is org.apache.uima.examples.casMultiplier.SimpleTextMerger and + the source code is located under the examples/src directory. +
+ Process Method + Almost all of the code for this example is in the process method. The first part of + the process method shows how to copy Feature Structures from the input CAS to the + "merged CAS": + + + + public void process(JCas aJCas) throws AnalysisEngineProcessException { + // procure a new CAS if we don't have one already + if (mMergedCas == null) { + mMergedCas = getEmptyJCas(); + } + + // append document text + String docText = aJCas.getDocumentText(); + int prevDocLen = mDocBuf.length(); + mDocBuf.append(docText); + + // copy specified annotation types + CasCopier copier = new CasCopier(mMergedCas.getCas()); + Set copiedIndexedFs = new HashSet(); // needed in case one annotation is in two indexes (could + // happen if specified annotation types overlap) + for (int i = 0; i < mAnnotationTypesToCopy.length; i++) { + Type type = mMergedCas.getTypeSystem().getType(mAnnotationTypesToCopy[i]); + FSIndex index = aJCas.getCas().getAnnotationIndex(type); + Iterator iter = index.iterator(); + while (iter.hasNext()) { + FeatureStructure fs = (FeatureStructure) iter.next(); + if (!copiedIndexedFs.contains(fs)) { + Annotation copyOfFs = (Annotation) copier.copyFs(fs); + // update begin and end + copyOfFs.setBegin(copyOfFs.getBegin() + prevDocLen); + copyOfFs.setEnd(copyOfFs.getEnd() + prevDocLen); + mMergedCas.addFsToIndexes(copyOfFs); + copiedIndexedFs.add(fs); + } + } + } + + + The CasCopier class is used to copy Feature Structures of certain types + (specified by a configuration parameter) to the merged CAS. The CasCopier does deep + copies, meaning that if the copied FeatureStructure references another FeatureStructure, the + referenced FeatureStructure will also be copied. + + This example also merges the document text using a separate StringBuffer. Note + that we cannot append document text to the Sofa data of the merged CAS because Sofa data cannot be modified + once it is set. + + The remainder of the process method determines whether it is time to output a new + CAS. For this example, we are attempting to merge all CASes that are segments of one original artifact. This + is done by checking the + SourceDocumentInformation Feature Structure in the CAS to see if its + lastSegment feature is set to true. That feature (which is set by the + example + SimpleTextSegmenter discussed previously) marks the CAS as being the last segment of an + artifact, so when the CAS Multiplier sees this segment it knows it is time to produce an output CAS. + + + + // get the SourceDocumentInformation FS, which indicates the sourceURI of the document + // and whether the incoming CAS is the last segment + FSIterator it = aJCas.getJFSIndexRepository() + .getAnnotationIndex(SourceDocumentInformation.type).iterator(); + if (!it.hasNext()) { + throw new RuntimeException("Missing SourceDocumentInformation"); + } + SourceDocumentInformation sourceDocInfo = (SourceDocumentInformation) it.next(); + if (sourceDocInfo.getLastSegment()) { + // time to produce an output CAS + // set the document text + mMergedCas.setDocumentText(mDocBuf.toString()); + + // add source document info to destination CAS + SourceDocumentInformation destSDI = new SourceDocumentInformation(mMergedCas); + destSDI.setUri(sourceDocInfo.getUri()); + destSDI.setOffsetInSource(0); + destSDI.setLastSegment(true); + destSDI.addToIndexes(); + + mDocBuf = new StringBuffer(); + mReadyToOutput = true; + } + } + + + When it is time to produce an output CAS, the CAS Multiplier makes final updates to the merged CAS + (setting the document text and adding a SourceDocumentInformation + FeatureStructure), and then sets the mReadyToOutput field to true. This field is + then used in the hasNext and next methods. +
+
+ HasNext and Next Methods + These methods are relatively simple: + + + + public boolean hasNext() throws AnalysisEngineProcessException { + return mReadyToOutput; + } + + public AbstractCas next() throws AnalysisEngineProcessException { + if (!mReadyToOutput) { + throw new RuntimeException("No next CAS"); + } + JCas casToReturn = mMergedCas; + mMergedCas = null; + mReadyToOutput = false; + return casToReturn; + } + + When the merged CAS is ready to be output, hasNext will return true, and + next will return the merged CAS, taking care to set the + mMergedCas field to + null so that the next call to + process will start with a fresh CAS. +
+
+
+ Using the SimpleTextMerger in an Aggregate Analysis Engine + An example descriptor for an Aggregate Analysis Engine that uses the + SimpleTextMerger is provided in + examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml. This + Aggregate first runs the SimpleTextSegmenter example to break a large document into + segments. It then runs each segment through the example tokenizer and name recognizer annotators. Finally + it runs the SimpleTextMerger to reassemble the segments back into one CAS. The + Name annotations are copied to the final merged CAS but the Token + annotations are not. + This example illustrates how you can break large artifacts into pieces for more efficient processing + and then reassemble a single output CAS containing only the results most useful to the application. + Intermediate results such as tokens, which may consume a lot of space, need not be retained over the entire + input artifact. + + The intermediate segments are dropped and are never output from the Aggregate Analysis Engine. This + is done by configuring the Fixed Flow Controller as described in + , above. + + Try running this Analysis Engine in the Document Analyzer tool with a large text file as input, to see that + it outputs just one CAS per input file, and that the final CAS contains only the Name annotations. +