CAS Multiplier Developer's Guide

Example Code - This section walks through the source code of an example CAS Multiplier that - breaks text documents into smaller pieces. The Java class for the example is - org.apache.uima.examples.casMultiplier.SimpleTextSegmenter - and the source code is included in the UIMA SDK under the - examples/src directory. + This section walks through the source code of an example CAS Multiplier that breaks text documents into + smaller pieces. The Java class for the example is + org.apache.uima.examples.casMultiplier.SimpleTextSegmenter and the source + code is included in the UIMA SDK under the examples/src directory. -

Overall Structure +

+ Overall Structure public class SimpleTextSegmenter extends JCasMultiplier_ImplBase { @@ -157,13 +150,14 @@ The SimpleTextSegmenter class extends JCasMultiplier_ImplBase and implements the optional - initialize method as well as the required - process, hasNext, and - next methods. Each method is described below. + initialize method as well as the required process, + hasNext, and next methods. Each method is described + below.

Initialize Method +

+ Initialize Method public void initialize(UimaContext aContext) throws @@ -173,15 +167,15 @@ "segmentSize")).intValue(); } - Like an Annotator, a CAS Multiplier can override the initialize method and - read configuration parameter values from the UimaContext. The - SimpleTextSegmenter defines one parameter, Segment Size, - which determines the approximate size (in characters) of each segment that it will + Like an Annotator, a CAS Multiplier can override the initialize method and read configuration + parameter values from the UimaContext. The SimpleTextSegmenter defines one parameter, Segment + Size, which determines the approximate size (in characters) of each segment that it will produce.

Process Method +

+ Process Method public void process(JCas aJCas) @@ -201,93 +195,90 @@ } } - The process method receives a new JCas to be processed(segmented) by this CAS - Multiplier. The SimpleTextSegmenter extracts some information from this JCas - and stores it in fields (the document text is stored in the field mDoc and the source - URI in the field mDocURI). Recall that the CAS Multiplier is considered to - own the JCas from the time when process is called until the time - when hasNext returns false. Therefore it is acceptable to retain references to - objects from the JCas in a CAS Multiplier, whereas this should never be done in an - Annotator. The CAS Multiplier could have chosen to store a reference to the JCas - itself, but that was not necessary for this example. - - The CAS Multiplier also initializes the mPos variable to 0. This variable is a - position into the document text and will be incremented as each new segment is - produced. + The process method receives a new JCas to be processed(segmented) by this CAS Multiplier. The + SimpleTextSegmenter extracts some information from this JCas and stores it in fields (the document text + is stored in the field mDoc and the source URI in the field mDocURI). Recall that the CAS Multiplier is + considered to own the JCas from the time when process is called until the time when hasNext + returns false. Therefore it is acceptable to retain references to objects from the JCas in a CAS + Multiplier, whereas this should never be done in an Annotator. The CAS Multiplier could have chosen to + store a reference to the JCas itself, but that was not necessary for this example. + + The CAS Multiplier also initializes the mPos variable to 0. This variable is a position into the + document text and will be incremented as each new segment is produced.

HasNext Method +

+ HasNext Method public boolean hasNext() throws AnalysisEngineProcessException { return mPos < mDoc.length(); } - The job of the hasNext method is to report whether there are any additional - output CASes to produce. For this example, the CAS Multiplier will break the entire - input document into segments, so we know there will always be a next segment until - the very end of the document has been reached. + The job of the hasNext method is to report whether there are any additional output CASes to produce. For + this example, the CAS Multiplier will break the entire input document into segments, so we know there will + always be a next segment until the very end of the document has been reached.

Next Method +

+ Next Method - public AbstractCas next() throws AnalysisEngineProcessException { - int breakAt = mPos + mSegmentSize; - if (breakAt > mDoc.length()) - breakAt = mDoc.length(); - - // Search for the next newline character. Note: this example - // segmenter implementation assumes that the document contains many - // newlines. In the worst case, if this segmenter is run on a - // document with no newlines, it will produce only one segment - // containing the entire document text. A better implementation - // might specify a maximum segment size as well as a minimum. - - while (breakAt < mDoc.length() && mDoc.charAt(breakAt-1) != 'n') - breakAt++; - - JCas jcas = getEmptyJCas(); - try { - jcas.setDocumentText(mDoc.substring(mPos, breakAt)); - //if original CAS had SourceDocumentInformation, - //also add SourceDocumentInformation to each segment - if (mDocUri != null) { - SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); - sdi.setUri(mDocUri); - sdi.setOffsetInSource(mPos); - sdi.setDocumentSize(breakAt - mPos); - sdi.addToIndexes(); + public AbstractCas next() throws AnalysisEngineProcessException { + int breakAt = mPos + mSegmentSize; + if (breakAt > mDoc.length()) + breakAt = mDoc.length(); + // search for the next newline character. Note: this example segmenter implementation + // assumes that the document contains many newlines. In the worst case, if this segmenter + // is runon a document with no newlines, it will produce only one segment containing the + // entire document text. A better implementation might specify a maximum segment size as + // well as a minimum. + while (breakAt < mDoc.length() && mDoc.charAt(breakAt - 1) != '\n') + breakAt++; + + JCas jcas = getEmptyJCas(); + try { + jcas.setDocumentText(mDoc.substring(mPos, breakAt)); + // if original CAS had SourceDocumentInformation, also add SourceDocumentInformatio + // to each segment + if (mDocUri != null) { + SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); + sdi.setUri(mDocUri); + sdi.setOffsetInSource(mPos); + sdi.setDocumentSize(breakAt - mPos); + sdi.addToIndexes(); + + if (breakAt == mDoc.length()) { + sdi.setLastSegment(true); + } + } + + mPos = breakAt; + return jcas; + } catch (Exception e) { + jcas.release(); + throw new AnalysisEngineProcessException(e); } - - mPos = breakAt; - return jcas; - } - catch(Exception e) { - jcas.release(); - throw new AnalysisEngineProcessException(e); - } -} + } - The next method actually produces the next segment and - returns it. The framework guarantees that it will not call - next unless hasNext has returned true - since the last call to process or next - . + The next method actually produces the next segment and returns it. The + framework guarantees that it will not call next unless + hasNext has returned true since the last call to process or + next . - Note that in order to produce a segment, the CAS Multiplier must get an empty - JCas to populate. This is done by the line: + Note that in order to produce a segment, the CAS Multiplier must get an empty JCas to populate. This is + done by the line: JCas jcas = getEmptyJCas(); - This requests an empty JCas from the framework, which maintains a pool of JCas - instances to draw from. + This requests an empty JCas from the framework, which maintains a pool of JCas instances to draw + from. - Also, note the use of the try...catch block to ensure - that a JCas is released back to the pool if an exception occurs. This is very - important to allow a CAS Multiplier to recover from errors. + Also, note the use of the try...catch block to ensure that a JCas is released back + to the pool if an exception occurs. This is very important to allow a CAS Multiplier to recover from + errors.

@@ -296,21 +287,19 @@

Creating the CAS Multiplier Descriptor - There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are - considered a type of Analysis Engine, and so their descriptors use the same syntax as any - other Analysis Engine Descriptor. - - The descriptor for the SimpleTextSegmenter is located in the - examples/descriptors/cas_multiplier/SimpleTextSegmenter.xml directory - of the UIMA SDK. - - The Analysis Engine Description, in its Operational Properties - section, now contains a new outputsNewCASes property which takes a - Boolean value. If the Analysis Engine is a CAS Multiplier, this property should be set to - true. + There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are considered a type of + Analysis Engine, and so their descriptors use the same syntax as any other Analysis Engine Descriptor. + + The descriptor for the SimpleTextSegmenter is located in the + examples/descriptors/cas_multiplier/SimpleTextSegmenter.xml directory of the + UIMA SDK. + + The Analysis Engine Description, in its Operational Properties section, now contains a + new outputsNewCASes property which takes a Boolean value. If the Analysis Engine is a CAS + Multiplier, this property should be set to true. - If you use the CDE, be sure to check the Outputs new CASes box in the - Runtime Information section on the Overview page, as shown here: + If you use the CDE, be sure to check the Outputs new CASes box in the Runtime Information + section on the Overview page, as shown here: @@ -325,38 +314,44 @@ If you edit the Analysis Engine Descriptor by hand, you need to add a - <outputsNewCASes> element to your descriptor as shown - here: + <outputsNewCASes> element to your descriptor as shown here: - <operationalProperties> - <modifiesCas>false</modifiesCas> - <multipleDeploymentAllowed>true</multipleDeploymentAllowed> - <outputsNewCASes>true</outputsNewCASes> + + <operationalProperties> + <modifiesCas>false</modifiesCas> + <multipleDeploymentAllowed>true</multipleDeploymentAllowed> + <outputsNewCASes>true</outputsNewCASes> </operationalProperties> - The modifiedCas operational property refers to the input - CAS, not the new output CASes produced. So our example SimpleTextSegmenter has - modifiesCas set to false since it doesn't modify the input CAS. + + The modifiedCas operational property refers to the input CAS, not the new output CASes + produced. So our example SimpleTextSegmenter has modifiesCas set to false since it doesn't modify the + input CAS.

Using a CAS Multiplier in an Aggregate Analysis Engine - You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For - example, this allows you to construct an Aggregate Analysis Engine that takes each - input CAS, breaks it up into segments, and runs a series of Annotators on each - segment. + You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For example, this allows + you to construct an Aggregate Analysis Engine that takes each input CAS, breaks it up into segments, and runs a + series of Annotators on each segment.

Adding the CAS Multiplier to the Aggregate - Since CAS Multiplier are considered a type of Analysis Engine, adding them to an - aggregate works the same way as for other Analysis Engines. Using the CDE, you just - click the Add... button in the Component Engines view and browse to - the Analysis Engine Descriptor of your CAS Multiplier. If editing the aggregate - descriptor directly, just import the Analysis Engine - Descriptor of your CAS Multiplier as usual. + Since CAS Multiplier are considered a type of Analysis Engine, adding them to an aggregate works the same + way as for other Analysis Engines. Using the CDE, you just click the Add... button in the + Component Engines view and browse to the Analysis Engine Descriptor of your CAS Multiplier. If editing the + aggregate descriptor directly, just import the Analysis Engine Descriptor of your + CAS Multiplier as usual. + + An example descriptor for an Aggregate Analysis Engine containing a CAS Multiplier is provided in + examples/descriptors/cas_multiplier/SegmenterAndTokenizerAE.xml. This + Aggregate runs the SimpleTextSegmenter example to break a large document into + segments, and then runs each segment through the SimpleTokenAndSentenceAnnotator. + Try running it in the Document Analyzer tool with a large text file as input, to see that it outputs multiple + output CASes, one for each segment produced by the SimpleTextSegmenter.

@@ -388,15 +383,22 @@ that implement's UIMA's default flow defines a configuration parameter ActionAfterCasMultiplier that can take the following values: - continue – the CAS continues on to the next element in the - flow - stop – the CAS will no longer continue in the flow, and will be - returned from the aggregate if possible. - drop – the CAS will no longer continue in the flow, and will be dropped - (not returned from the aggregate) if possible. - dropIfNewCasProduced (the default) – if the CAS multiplier - produced a new CAS as a result of processing this CAS, then this CAS will be dropped. If not, then this CAS will - continue. + + continue – the CAS continues on to the next element in the flow + + + stop – the CAS will no longer continue in the flow, and will be returned + from the aggregate if possible. + + + drop – the CAS will no longer continue in the flow, and will be dropped + (not returned from the aggregate) if possible. + + + dropIfNewCasProduced (the default) – if the CAS multiplier produced + a new CAS as a result of processing this CAS, then this CAS will be dropped. If not, then this CAS will + continue. + You can override this parameter in your Aggregate Analysis Engine the same way you would override a @@ -404,6 +406,7 @@ FixedFlowController implementation by importing its descriptor into your aggregate as follows: + <flowController key="FixedFlowController"> <import name="org.apache.uima.flow.FixedFlowController"/> @@ -411,6 +414,8 @@ The parameter could then be overriden as, for example: + + <configurationParameters> <configurationParameter> @@ -434,8 +439,8 @@ </configurationParameterSettings> - This overriding can also be done using the Component Descriptor Editor tool. - An example of an Analysis Engine that overrides this parameter can be found in + This overriding can also be done using the Component Descriptor Editor tool. An example of an Analysis + Engine that overrides this parameter can be found in examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml. For more information about how to specify a flow controller as part of your Aggregate Analysis Engine descriptor, see . @@ -447,33 +452,28 @@

Aggregate CAS Multipliers +

+ Aggregate CAS Multipliers - An important consideration when you put a CAS Multiplier inside an Aggregate - Analysis Engine is whether you want the Aggregate to also function as a CAS Multiplier - – that is, whether you want the new output CASes produced within the Aggregate - to be output from the Aggregate. This is controlled by the - <outputsNewCASes> element in the Operational - Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as - what was described in - . - - If you set this property to true, then any new output CASes - produced by a CAS Multiplier inside this Aggregate will be output from the Aggregate. - Thus the Aggregate will function as a CAS Multiplier and can be used in any of the ways in - which a primitive CAS Multiplier can be used. - - If you set the <outputsNewCASes> property to false - , then any new output CASes produced by a CAS Multiplier inside the Aggregate will be - dropped (i.e. the CASes will be released back to the pool) once they have finished - being processed. Such an Aggregate Analysis Engine functions just like a - normal non-CAS-Multiplier Analysis Engine; the fact that CAS - Multiplication is occurring inside it is hidden from users of that Analysis - Engine. - If you want to output some new Output CASes and not others, you need to - implement a custom Flow Controller that makes this decision — see . - + An important consideration when you put a CAS Multiplier inside an Aggregate Analysis Engine is whether + you want the Aggregate to also function as a CAS Multiplier + – that is, whether you want the new output CASes produced within the Aggregate to be output from the + Aggregate. This is controlled by the <outputsNewCASes> element in the + Operational Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as what was + described in . + + If you set this property to true, then any new output CASes produced by a CAS + Multiplier inside this Aggregate will be output from the Aggregate. Thus the Aggregate will function as a CAS + Multiplier and can be used in any of the ways in which a primitive CAS Multiplier can be used. + + If you set the <outputsNewCASes> property to false , then any new output + CASes produced by a CAS Multiplier inside the Aggregate will be dropped (i.e. the CASes will be released back + to the pool) once they have finished being processed. Such an Aggregate Analysis Engine functions just like a + normal non-CAS-Multiplier Analysis Engine; the fact that CAS Multiplication is + occurring inside it is hidden from users of that Analysis Engine. + If you want to output some new Output CASes and not others, you need to implement a custom Flow Controller + that makes this decision — see .

@@ -481,46 +481,42 @@

Using a CAS Multiplier in a Collection Processing Engine - It is currently a limitation that CAS Multiplier cannot be deployed directly in a - Collection Processing Engine. The only way that you can use a CAS Multiplier in a CPE is to - first wrap it in an Aggregate Analysis Engine whose outputsNewCASes - property is set to false, which in effect hides the - existence of the CAS Multiplier from the CPE. - - Note that you can build an Aggregate Analysis Engine that consists of CAS - Multipliers and Annotators, followed by CAS Consumers. This can simulate what a CPE - would do, but without the deployment and error handling options that the CPE - provides. + It is currently a limitation that CAS Multiplier cannot be deployed directly in a Collection Processing + Engine. The only way that you can use a CAS Multiplier in a CPE is to first wrap it in an Aggregate Analysis Engine + whose outputsNewCASes property is set to false, which in effect + hides the existence of the CAS Multiplier from the CPE. + + Note that you can build an Aggregate Analysis Engine that consists of CAS Multipliers and Annotators, + followed by CAS Consumers. This can simulate what a CPE would do, but without the deployment and error handling + options that the CPE provides.

Calling a CAS Multiplier from an Application - The AnalysisEngine interface has the following methods - that allow you to interact with CAS Multiplier: - CasIterator - processAndOutputNewCASes(CAS) + The AnalysisEngine interface has the following methods that allow you to interact + with CAS Multiplier: + + + CasIterator processAndOutputNewCASes(CAS) - JCasIterator - processAndOutputNewCASes(JCas) + + JCasIterator processAndOutputNewCASes(JCas) - From your application, you call processAndOutputNewCASes - and pass it the input CAS. An iterator is returned that allows you to step through each of - the new output CASes that are produced by the Analysis Engine. - - It is very important to realize that CASes are pooled objects and so your - application must release each CAS (by calling the CAS.release() - method) that it obtains from the CasIterator before it calls - the CasIterator.next method again. Otherwise, the CAS pool will - be exhausted and a deadlock will occur. - - The example code in the class - org.apache.uima.examples.casMultiplier. - CasMultiplierExampleApplication illusrates this. Here is the main - processing loop: + From your application, you call processAndOutputNewCASes and pass it the input + CAS. An iterator is returned that allows you to step through each of the new output CASes that are produced by the + Analysis Engine. + + It is very important to realize that CASes are pooled objects and so your application must release each CAS + (by calling the CAS.release() method) that it obtains from the CasIterator + before it calls the CasIterator.next method again. Otherwise, + the CAS pool will be exhausted and a deadlock will occur. + + The example code in the class org.apache.uima.examples.casMultiplier. + CasMultiplierExampleApplication illusrates this. Here is the main processing loop: CasIterator casIterator = ae.processAndOutputNewCASes(initialCas); @@ -536,24 +532,209 @@ outCas.release(); Note that as defined by the CAS Multiplier contract in , the CAS Multiplier owns the - input CAS (initialCAS in the example) until the last new output - CAS has been produced. This means that the application should not try to make changes to - initialCAS until after the - CasIterator.hasNext method has returned false, indicating - that the segmenter has finished. - - Note that the processing time of the Analysis Engine is spread out over the calls to - the CasIterator's hasNext and next - methods. That is, the next output CAS may not actually be produced and annotated until - the application asks for it. So the application should not expect calls to the - CasIterator to necessarily complete quickly. - - Also, calls to the CasIterator may throw Exceptions - indicating an error has occurred during processing. If an Exception is thrown, all - processing of the input CAS will stop, and no more output CASes will be produced. There is - currently no error recovery mechanism that will allow processing to continue after an - exception. + linkend="ugr.tug.cm.cm_interface_overview"/>, the CAS Multiplier owns the input CAS + (initialCAS in the example) until the last new output CAS has been produced. This means + that the application should not try to make changes to initialCAS until after the + CasIterator.hasNext method has returned false, indicating that the segmenter has + finished. + + Note that the processing time of the Analysis Engine is spread out over the calls to the + CasIterator's hasNext and next methods. That is, the next + output CAS may not actually be produced and annotated until the application asks for it. So the application + should not expect calls to the CasIterator to necessarily complete quickly. + + Also, calls to the CasIterator may throw Exceptions indicating an error has + occurred during processing. If an Exception is thrown, all processing of the input CAS will stop, and no more + output CASes will be produced. There is currently no error recovery mechanism that will allow processing to + continue after an exception. +

+ +

+ Using a CAS Multiplier to Merge CASes + A CAS Multiplier can also be used to combine smaller CASes together to form larger CASes. In this section we + describe how this works and walk through an example. + +

+ Overview of How to Merge CASes + + + + When the framework first calls the CAS Multiplier's process method, + the CAS Multiplier requests an empty CAS (which we'll call the "merged CAS") and copies relevant data + from the input CAS into the merged CAS. The class + org.apache.uima.util.CasCopier provides utilities for copying Feature + Structures between CASes. + + + + When the framework then calls the CAS Multiplier's hasNext method, the + CAS Multiplier returns false to indicate that it has no output at this + time. + + + + When the framework calls process again with a new input CAS, the CAS + Multiplier copies data from that input CAS into the merged CAS, combining it with the data that was + previously copied. + + + + Eventually, when the CAS Multiplier decides that it wants to output the merged CAS, it returns + true from the hasNext method, and then when the framework + subsequently calls the next method, the CAS Multiplier returns the merged + CAS. + + + There is no explicit call to flush out any pending CASes from a CAS Multiplier when collection processing + completes. It is up to the application to provide some mechanism to let a CAS Multiplier recognize the last CAS + in a collection so that it can ensure that its final output CASes are complete. +

+ Example CAS Merger + An example CAS Multiplier that merges CASes can be found is provided in the UIMA SDK. The Java class for + this example is org.apache.uima.examples.casMultiplier.SimpleTextMerger and + the source code is located under the examples/src directory. +

+ Process Method + Almost all of the code for this example is in the process method. The first part of + the process method shows how to copy Feature Structures from the input CAS to the + "merged CAS": + + + + public void process(JCas aJCas) throws AnalysisEngineProcessException { + // procure a new CAS if we don't have one already + if (mMergedCas == null) { + mMergedCas = getEmptyJCas(); + } + + // append document text + String docText = aJCas.getDocumentText(); + int prevDocLen = mDocBuf.length(); + mDocBuf.append(docText); + + // copy specified annotation types + CasCopier copier = new CasCopier(mMergedCas.getCas()); + Set copiedIndexedFs = new HashSet(); // needed in case one annotation is in two indexes (could + // happen if specified annotation types overlap) + for (int i = 0; i < mAnnotationTypesToCopy.length; i++) { + Type type = mMergedCas.getTypeSystem().getType(mAnnotationTypesToCopy[i]); + FSIndex index = aJCas.getCas().getAnnotationIndex(type); + Iterator iter = index.iterator(); + while (iter.hasNext()) { + FeatureStructure fs = (FeatureStructure) iter.next(); + if (!copiedIndexedFs.contains(fs)) { + Annotation copyOfFs = (Annotation) copier.copyFs(fs); + // update begin and end + copyOfFs.setBegin(copyOfFs.getBegin() + prevDocLen); + copyOfFs.setEnd(copyOfFs.getEnd() + prevDocLen); + mMergedCas.addFsToIndexes(copyOfFs); + copiedIndexedFs.add(fs); + } + } + } + + + The CasCopier class is used to copy Feature Structures of certain types + (specified by a configuration parameter) to the merged CAS. The CasCopier does deep + copies, meaning that if the copied FeatureStructure references another FeatureStructure, the + referenced FeatureStructure will also be copied. + + This example also merges the document text using a separate StringBuffer. Note + that we cannot append document text to the Sofa data of the merged CAS because Sofa data cannot be modified + once it is set. + + The remainder of the process method determines whether it is time to output a new + CAS. For this example, we are attempting to merge all CASes that are segments of one original artifact. This + is done by checking the + SourceDocumentInformation Feature Structure in the CAS to see if its + lastSegment feature is set to true. That feature (which is set by the + example + SimpleTextSegmenter discussed previously) marks the CAS as being the last segment of an + artifact, so when the CAS Multiplier sees this segment it knows it is time to produce an output CAS. + + + + // get the SourceDocumentInformation FS, which indicates the sourceURI of the document + // and whether the incoming CAS is the last segment + FSIterator it = aJCas.getJFSIndexRepository() + .getAnnotationIndex(SourceDocumentInformation.type).iterator(); + if (!it.hasNext()) { + throw new RuntimeException("Missing SourceDocumentInformation"); + } + SourceDocumentInformation sourceDocInfo = (SourceDocumentInformation) it.next(); + if (sourceDocInfo.getLastSegment()) { + // time to produce an output CAS + // set the document text + mMergedCas.setDocumentText(mDocBuf.toString()); + + // add source document info to destination CAS + SourceDocumentInformation destSDI = new SourceDocumentInformation(mMergedCas); + destSDI.setUri(sourceDocInfo.getUri()); + destSDI.setOffsetInSource(0); + destSDI.setLastSegment(true); + destSDI.addToIndexes(); + + mDocBuf = new StringBuffer(); + mReadyToOutput = true; + } + } + + + When it is time to produce an output CAS, the CAS Multiplier makes final updates to the merged CAS + (setting the document text and adding a SourceDocumentInformation + FeatureStructure), and then sets the mReadyToOutput field to true. This field is + then used in the hasNext and next methods. +

+ HasNext and Next Methods + These methods are relatively simple: + + + + public boolean hasNext() throws AnalysisEngineProcessException { + return mReadyToOutput; + } + + public AbstractCas next() throws AnalysisEngineProcessException { + if (!mReadyToOutput) { + throw new RuntimeException("No next CAS"); + } + JCas casToReturn = mMergedCas; + mMergedCas = null; + mReadyToOutput = false; + return casToReturn; + } + + When the merged CAS is ready to be output, hasNext will return true, and + next will return the merged CAS, taking care to set the + mMergedCas field to + null so that the next call to + process will start with a fresh CAS. +

+ Using the SimpleTextMerger in an Aggregate Analysis Engine + An example descriptor for an Aggregate Analysis Engine that uses the + SimpleTextMerger is provided in + examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml. This + Aggregate first runs the SimpleTextSegmenter example to break a large document into + segments. It then runs each segment through the example tokenizer and name recognizer annotators. Finally + it runs the SimpleTextMerger to reassemble the segments back into one CAS. The + Name annotations are copied to the final merged CAS but the Token + annotations are not. + This example illustrates how you can break large artifacts into pieces for more efficient processing + and then reassemble a single output CAS containing only the results most useful to the application. + Intermediate results such as tokens, which may consume a lot of space, need not be retained over the entire + input artifact. + + The intermediate segments are dropped and are never output from the Aggregate Analysis Engine. This + is done by configuring the Fixed Flow Controller as described in + , above. + + Try running this Analysis Engine in the Document Analyzer tool with a large text file as input, to see that + it outputs just one CAS per input file, and that the final CAS contains only the Name annotations. +