apex-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sas...@apache.org
Subject [2/3] incubator-apex-site git commit: Updating apex-3.3 docs with latest mkdocs processor which fixes search results bug
Date Thu, 10 Mar 2016 01:53:48 GMT
http://git-wip-us.apache.org/repos/asf/incubator-apex-site/blob/1ae98d4a/docs/apex-3.3/mkdocs/search_index.json
----------------------------------------------------------------------
diff --git a/docs/apex-3.3/mkdocs/search_index.json b/docs/apex-3.3/mkdocs/search_index.json
index 85c9b89..34dd42f 100644
--- a/docs/apex-3.3/mkdocs/search_index.json
+++ b/docs/apex-3.3/mkdocs/search_index.json
@@ -72,10 +72,20 @@
         }, 
         {
             "location": "/application_development/#hadoop-cluster", 
-            "text": "In this section we discuss various Hadoop cluster setups.  Single Node Cluster  In a single node Hadoop cluster all services are deployed on a\nsingle server (a developer can use his/her development machine as a\nsingle node cluster). The platform does not distinguish between a single\nor multi-node setup and behaves exactly the same in both cases.  In this mode, the resource manager, name node, data node, and node\nmanager occupy one process each. This is an example of running a\nstreaming application as a multi-process\u00a0application on the same server.\nWith prevalence of fast, multi-core systems, this mode is effective for\ndebugging, fine tuning, and generic analysis before submitting the job\nto a larger Hadoop cluster. In this mode, execution uses the Hadoop\nservices and hence is likely to identify issues that are related to the\nHadoop environment (such issues will not be uncovered in local mode).\nThe throughput will obviously not be as high as on a 
 multi-node Hadoop\ncluster. Additionally, since each container (i.e. Java process) requires\na significant amount of memory, you will be able to run a much smaller\nnumber of containers than on a multi-node cluster.  Multi-Node Cluster  In a multi-node Hadoop cluster all the services of Hadoop are\ntypically distributed across multiple nodes in a production or\nproduction-level test environment. Upon launch the application is\nsubmitted to the Hadoop cluster and executes as a  multi-processapplication on\u00a0multiple nodes.  Before you start deploying, testing and troubleshooting your\napplication on a cluster, you should ensure that Hadoop (version 2.2.0\nor later)\u00a0is properly installed and\nyou have basic skills for working with it.", 
+            "text": "In this section we discuss various Hadoop cluster setups.", 
             "title": "Hadoop Cluster"
         }, 
         {
+            "location": "/application_development/#single-node-cluster", 
+            "text": "In a single node Hadoop cluster all services are deployed on a\nsingle server (a developer can use his/her development machine as a\nsingle node cluster). The platform does not distinguish between a single\nor multi-node setup and behaves exactly the same in both cases.  In this mode, the resource manager, name node, data node, and node\nmanager occupy one process each. This is an example of running a\nstreaming application as a multi-process\u00a0application on the same server.\nWith prevalence of fast, multi-core systems, this mode is effective for\ndebugging, fine tuning, and generic analysis before submitting the job\nto a larger Hadoop cluster. In this mode, execution uses the Hadoop\nservices and hence is likely to identify issues that are related to the\nHadoop environment (such issues will not be uncovered in local mode).\nThe throughput will obviously not be as high as on a multi-node Hadoop\ncluster. Additionally, since each container (i.e. Java proces
 s) requires\na significant amount of memory, you will be able to run a much smaller\nnumber of containers than on a multi-node cluster.", 
+            "title": "Single Node Cluster"
+        }, 
+        {
+            "location": "/application_development/#multi-node-cluster", 
+            "text": "In a multi-node Hadoop cluster all the services of Hadoop are\ntypically distributed across multiple nodes in a production or\nproduction-level test environment. Upon launch the application is\nsubmitted to the Hadoop cluster and executes as a  multi-processapplication on\u00a0multiple nodes.  Before you start deploying, testing and troubleshooting your\napplication on a cluster, you should ensure that Hadoop (version 2.2.0\nor later)\u00a0is properly installed and\nyou have basic skills for working with it.", 
+            "title": "Multi-Node Cluster"
+        }, 
+        {
             "location": "/application_development/#apache-apex-platform-overview", 
             "text": "", 
             "title": "Apache Apex Platform Overview"
@@ -92,40 +102,190 @@
         }, 
         {
             "location": "/application_development/#hadoop-components", 
-            "text": "In this section we cover some aspects of Hadoop that your\nstreaming application interacts with. This section is not meant to\neducate the reader on Hadoop, but just get the reader acquainted with\nthe terms. We strongly advise readers to learn Hadoop from other\nsources.  A streaming application runs as a native Hadoop 2.2 application.\nHadoop 2.2 does not differentiate between a map-reduce job and other\napplications, and hence as far as Hadoop is concerned, the streaming\napplication is just another job. This means that your application\nleverages all the bells and whistles Hadoop provides and is fully\nsupported within Hadoop technology stack. The platform is responsible\nfor properly integrating itself with the relevant components of Hadoop\nthat exist today and those that may emerge in the future  All investments that leverage multi-tenancy (for example quotas\nand queues), security (for example kerberos), data flow integration (for\nexample copying data i
 n-out of HDFS), monitoring, metrics collections,\netc. will require no changes when streaming applications run on\nHadoop.  YARN  YARN is\nthe core library of Hadoop 2.2 that is tasked with resource management\nand works as a distributed application framework. In this section we\nwill walk through Yarn's components. In Hadoop 2.2, the old jobTracker\nhas been replaced by a combination of ResourceManager (RM) and\nApplicationMaster (AM).  Resource Manager (RM)  ResourceManager (RM)\nmanages all the distributed resources. It allocates and arbitrates all\nthe slots and the resources (cpu, memory, network) of these slots. It\nworks with per-node NodeManagers (NMs) and per-application\nApplicationMasters (AMs). Currently memory usage is monitored by RM; in\nupcoming releases it will have CPU as well as network management. RM is\nshared by map-reduce and streaming applications. Running streaming\napplications requires no changes in the RM.  Application Master (AM)  The AM is the watchdog 
 or monitoring process for your application\nand has the responsibility of negotiating resources with RM and\ninteracting with NodeManagers to get the allocated containers started.\nThe AM is the starting point of your application and is considered user\ncode (not system Hadoop code). The AM itself runs in one container. All\nresource management within the application are managed by the AM. This\nis a critical feature for Hadoop 2.2 where tasks done by jobTracker in\nHadoop 1.0 have been distributed allowing Hadoop 2.2 to scale much\nbeyond Hadoop 1.0. STRAM is a native YARN ApplicationManager.  Node Managers (NM)  There is one  NodeManager (NM)\nper node in the cluster. All the containers (i.e. processes) on that\nnode are monitored by the NM. It takes instructions from RM and manages\nresources of that node as per RM instructions. NMs interactions are same\nfor map-reduce and for streaming applications. Running streaming\napplications requires no changes in the NM.  RPC Protocol  C
 ommunication among RM, AM, and NM is done via the Hadoop RPC\nprotocol. Streaming applications use the same protocol to send their\ndata. No changes are needed in RPC support provided by Hadoop to enable\ncommunication done by components of your application.  HDFS  Hadoop includes a highly fault tolerant, high throughput\ndistributed file system ( HDFS ).\nIt runs on commodity hardware, and your streaming application will, by\ndefault, use it. There is no difference between files created by a\nstreaming application and those created by map-reduce.", 
+            "text": "In this section we cover some aspects of Hadoop that your\nstreaming application interacts with. This section is not meant to\neducate the reader on Hadoop, but just get the reader acquainted with\nthe terms. We strongly advise readers to learn Hadoop from other\nsources.  A streaming application runs as a native Hadoop 2.2 application.\nHadoop 2.2 does not differentiate between a map-reduce job and other\napplications, and hence as far as Hadoop is concerned, the streaming\napplication is just another job. This means that your application\nleverages all the bells and whistles Hadoop provides and is fully\nsupported within Hadoop technology stack. The platform is responsible\nfor properly integrating itself with the relevant components of Hadoop\nthat exist today and those that may emerge in the future  All investments that leverage multi-tenancy (for example quotas\nand queues), security (for example kerberos), data flow integration (for\nexample copying data i
 n-out of HDFS), monitoring, metrics collections,\netc. will require no changes when streaming applications run on\nHadoop.", 
             "title": "Hadoop Components"
         }, 
         {
+            "location": "/application_development/#yarn", 
+            "text": "YARN is\nthe core library of Hadoop 2.2 that is tasked with resource management\nand works as a distributed application framework. In this section we\nwill walk through Yarn's components. In Hadoop 2.2, the old jobTracker\nhas been replaced by a combination of ResourceManager (RM) and\nApplicationMaster (AM).", 
+            "title": "YARN"
+        }, 
+        {
+            "location": "/application_development/#resource-manager-rm", 
+            "text": "ResourceManager (RM)\nmanages all the distributed resources. It allocates and arbitrates all\nthe slots and the resources (cpu, memory, network) of these slots. It\nworks with per-node NodeManagers (NMs) and per-application\nApplicationMasters (AMs). Currently memory usage is monitored by RM; in\nupcoming releases it will have CPU as well as network management. RM is\nshared by map-reduce and streaming applications. Running streaming\napplications requires no changes in the RM.", 
+            "title": "Resource Manager (RM)"
+        }, 
+        {
+            "location": "/application_development/#application-master-am", 
+            "text": "The AM is the watchdog or monitoring process for your application\nand has the responsibility of negotiating resources with RM and\ninteracting with NodeManagers to get the allocated containers started.\nThe AM is the starting point of your application and is considered user\ncode (not system Hadoop code). The AM itself runs in one container. All\nresource management within the application are managed by the AM. This\nis a critical feature for Hadoop 2.2 where tasks done by jobTracker in\nHadoop 1.0 have been distributed allowing Hadoop 2.2 to scale much\nbeyond Hadoop 1.0. STRAM is a native YARN ApplicationManager.", 
+            "title": "Application Master (AM)"
+        }, 
+        {
+            "location": "/application_development/#node-managers-nm", 
+            "text": "There is one  NodeManager (NM)\nper node in the cluster. All the containers (i.e. processes) on that\nnode are monitored by the NM. It takes instructions from RM and manages\nresources of that node as per RM instructions. NMs interactions are same\nfor map-reduce and for streaming applications. Running streaming\napplications requires no changes in the NM.", 
+            "title": "Node Managers (NM)"
+        }, 
+        {
+            "location": "/application_development/#rpc-protocol", 
+            "text": "Communication among RM, AM, and NM is done via the Hadoop RPC\nprotocol. Streaming applications use the same protocol to send their\ndata. No changes are needed in RPC support provided by Hadoop to enable\ncommunication done by components of your application.", 
+            "title": "RPC Protocol"
+        }, 
+        {
+            "location": "/application_development/#hdfs", 
+            "text": "Hadoop includes a highly fault tolerant, high throughput\ndistributed file system ( HDFS ).\nIt runs on commodity hardware, and your streaming application will, by\ndefault, use it. There is no difference between files created by a\nstreaming application and those created by map-reduce.", 
+            "title": "HDFS"
+        }, 
+        {
             "location": "/application_development/#developing-an-application", 
             "text": "In this chapter we describe the methodology to develop an\napplication using the Realtime Streaming Platform. The platform was\ndesigned to make it easy to build and launch sophisticated streaming\napplications with the developer having to deal only with the\napplication/business logic. The platform deals with details of where to\nrun what operators on which servers and how to correctly route streams\nof data among them.", 
             "title": "Developing An Application"
         }, 
         {
             "location": "/application_development/#development-process", 
-            "text": "While the platform does not mandate a specific methodology or set\nof development tools, we have recommendations to maximize productivity\nfor the different phases of application development.  Design   Identify common, reusable operators. Use a library\n    if possible.  Identify scalability and performance requirements before\n    designing the DAG.  Leverage attributes that the platform supports for scalability\n    and performance.  Use operators that are benchmarked and tested so that later\n    surprises are minimized. If you have glue code, create appropriate\n    unit tests for it.  Use THREAD_LOCAL locality for high throughput streams. If all\n    the operators on that stream cannot fit in one container,\n    try\u00a0NODE_LOCAL\u00a0locality. Both THREAD_LOCAL and\n    NODE_LOCAL streams avoid the Network Interface Card (NIC)\n    completly. The former uses intra-process communication to also avoid\n    serialization-deserialization overhead.  The overa
 ll throughput and latencies are are not necessarily\n    correlated to the number of operators in a simple way -- the\n    relationship is more nuanced. A lot depends on how much work\n    individual operators are doing, how many are able to operate in\n    parallel, and how much data is flowing through the arcs of the DAG.\n    It is, at times, better to break a computation down into its\n    constituent simple parts and then stitch them together via streams\n    to better utilize the compute resources of the cluster. Decide on a\n    per application basis the fine line between complexity of each\n    operator vs too many streams. Doing multiple computations in one\n    operator does save network I/O, while operators that are too complex\n    are hard to maintain.  Do not use operators that depend on the order of two streams\n    as far as possible. In such cases behavior is not idempotent.  Persist key information to HDFS if possible; it may be useful\n    for debugging later.  De
 cide on an appropriate fault tolerance mechanism. If some\n    data loss is acceptable, use the at-most-once mechanism as it has\n    fastest recovery.   Creating New Project  Please refer to the  Apex Application Packages \u00a0for\nthe basic steps for creating a new project.  Writing the application code  Preferably use an IDE (Eclipse, Netbeans etc.) that allows you to\nmanage dependencies and assists with the Java coding. Specific benefits\ninclude ease of managing operator library jar files, individual operator\nclasses, ports and properties. It will also highlight and assist to\nrectify issues such as type mismatches when adding streams while\ntyping.  Testing  Write test cases with JUnit or similar test framework so that code\nis tested as it is written. For such testing, the DAG can run in local\nmode within the IDE. Doing this may involve writing mock input or output\noperators for the integration points with external systems. For example,\ninstead of reading from a live da
 ta stream, the application in test mode\ncan read from and write to files. This can be done with a single\napplication DAG by instrumenting a test mode using settings in the\nconfiguration that is passed to the application factory\ninterface.  Good test coverage will not only eliminate basic validation errors\nsuch as missing port connections or property constraint violations, but\nalso validate the correct processing of the data. The same tests can be\nre-run whenever the application or its dependencies change (operator\nlibraries, version of the platform etc.)  Running an application  The platform provides a commandline tool called dtcli\u00a0for managing applications (launching,\nkilling, viewing, etc.). This tool was already discussed above briefly\nin the section entitled Running the Test Application. It will introspect\nthe jar file specified with the launch command for applications (classes\nthat implement ApplicationFactory) or property files that define\napplications. It wi
 ll also deploy the dependency jar files from the\napplication package to the cluster.  Dtcli can run the application in local mode (i.e. outside a\ncluster). It is recommended to first run the application in local mode\nin the development environment before launching on the Hadoop cluster.\nThis way some of the external system integration and correct\nfunctionality of the application can be verified in an easier to debug\nenvironment before testing distributed mode.  For more details on CLI please refer to the  dtCli Guide .", 
+            "text": "While the platform does not mandate a specific methodology or set\nof development tools, we have recommendations to maximize productivity\nfor the different phases of application development.", 
             "title": "Development Process"
         }, 
         {
+            "location": "/application_development/#design", 
+            "text": "Identify common, reusable operators. Use a library\n    if possible.  Identify scalability and performance requirements before\n    designing the DAG.  Leverage attributes that the platform supports for scalability\n    and performance.  Use operators that are benchmarked and tested so that later\n    surprises are minimized. If you have glue code, create appropriate\n    unit tests for it.  Use THREAD_LOCAL locality for high throughput streams. If all\n    the operators on that stream cannot fit in one container,\n    try\u00a0NODE_LOCAL\u00a0locality. Both THREAD_LOCAL and\n    NODE_LOCAL streams avoid the Network Interface Card (NIC)\n    completly. The former uses intra-process communication to also avoid\n    serialization-deserialization overhead.  The overall throughput and latencies are are not necessarily\n    correlated to the number of operators in a simple way -- the\n    relationship is more nuanced. A lot depends on how much work\n    individual op
 erators are doing, how many are able to operate in\n    parallel, and how much data is flowing through the arcs of the DAG.\n    It is, at times, better to break a computation down into its\n    constituent simple parts and then stitch them together via streams\n    to better utilize the compute resources of the cluster. Decide on a\n    per application basis the fine line between complexity of each\n    operator vs too many streams. Doing multiple computations in one\n    operator does save network I/O, while operators that are too complex\n    are hard to maintain.  Do not use operators that depend on the order of two streams\n    as far as possible. In such cases behavior is not idempotent.  Persist key information to HDFS if possible; it may be useful\n    for debugging later.  Decide on an appropriate fault tolerance mechanism. If some\n    data loss is acceptable, use the at-most-once mechanism as it has\n    fastest recovery.", 
+            "title": "Design"
+        }, 
+        {
+            "location": "/application_development/#creating-new-project", 
+            "text": "Please refer to the  Apex Application Packages \u00a0for\nthe basic steps for creating a new project.", 
+            "title": "Creating New Project"
+        }, 
+        {
+            "location": "/application_development/#writing-the-application-code", 
+            "text": "Preferably use an IDE (Eclipse, Netbeans etc.) that allows you to\nmanage dependencies and assists with the Java coding. Specific benefits\ninclude ease of managing operator library jar files, individual operator\nclasses, ports and properties. It will also highlight and assist to\nrectify issues such as type mismatches when adding streams while\ntyping.", 
+            "title": "Writing the application code"
+        }, 
+        {
+            "location": "/application_development/#testing", 
+            "text": "Write test cases with JUnit or similar test framework so that code\nis tested as it is written. For such testing, the DAG can run in local\nmode within the IDE. Doing this may involve writing mock input or output\noperators for the integration points with external systems. For example,\ninstead of reading from a live data stream, the application in test mode\ncan read from and write to files. This can be done with a single\napplication DAG by instrumenting a test mode using settings in the\nconfiguration that is passed to the application factory\ninterface.  Good test coverage will not only eliminate basic validation errors\nsuch as missing port connections or property constraint violations, but\nalso validate the correct processing of the data. The same tests can be\nre-run whenever the application or its dependencies change (operator\nlibraries, version of the platform etc.)", 
+            "title": "Testing"
+        }, 
+        {
+            "location": "/application_development/#running-an-application", 
+            "text": "The platform provides a commandline tool called dtcli\u00a0for managing applications (launching,\nkilling, viewing, etc.). This tool was already discussed above briefly\nin the section entitled Running the Test Application. It will introspect\nthe jar file specified with the launch command for applications (classes\nthat implement ApplicationFactory) or property files that define\napplications. It will also deploy the dependency jar files from the\napplication package to the cluster.  Dtcli can run the application in local mode (i.e. outside a\ncluster). It is recommended to first run the application in local mode\nin the development environment before launching on the Hadoop cluster.\nThis way some of the external system integration and correct\nfunctionality of the application can be verified in an easier to debug\nenvironment before testing distributed mode.  For more details on CLI please refer to the  dtCli Guide .", 
+            "title": "Running an application"
+        }, 
+        {
             "location": "/application_development/#application-api", 
-            "text": "This section introduces the API to write a streaming application.\nThe work involves connecting operators via streams to form the logical\nDAG. The steps are    Instantiate an application (DAG)    (Optional) Set Attributes   Assign application name  Set any other attributes as per application requirements     Create/re-use and instantiate operators   Assign operator name that is unique within the  application  Declare schema upfront for each operator (and thereby its ports)  (Optional) Set properties\u00a0 and attributes on the dag as per specification  Connect ports of operators via streams  Each stream connects one output port of an operator to one or  more input ports of other operators.  (Optional) Set attributes on the streams       Test the application.    There are two methods to create an application, namely Java, and\nProperties file. Java API is for applications being developed by humans,\nand properties file (Hadoop like) is more suited for DAGs gener
 ated by\ntools.  Java API  The Java API is the most common way to create a streaming\napplication. It is meant for application developers who prefer to\nleverage the features of Java, and the ease of use and enhanced\nproductivity provided by IDEs like NetBeans or Eclipse. Using Java to\nspecify the application provides extra validation abilities of Java\ncompiler, such as compile time checks for type safety at the time of\nwriting the code. Later in this chapter you can read more about\nvalidation support in the platform.  The developer specifies the streaming application by implementing\nthe ApplicationFactory interface, which is how platform tools (CLI etc.)\nrecognize and instantiate applications. Here we show how to create a\nYahoo! Finance application that streams the last trade price of a ticker\nand computes the high and low price in every 1 min window. Run above\n test application\u00a0to execute the\nDAG in local mode within the IDE.  Let us revisit how the Yahoo! Finance 
 test application constructs the DAG:  public class Application implements StreamingApplication\n{\n\n  ...\n\n  @Override\n  public void populateDAG(DAG dag, Configuration conf)\n  {\n    dag.getAttributes().attr(DAG.STRAM_WINDOW_SIZE_MILLIS).set(streamingWindowSizeMilliSeconds);\n\n    StockTickInput tick = getStockTickInputOperator( StockTickInput , dag);\n    SumKeyVal String, Long  dailyVolume = getDailyVolumeOperator( DailyVolume , dag);\n    ConsolidatorKeyVal String,Double,Long,String,?,?  quoteOperator = getQuoteOperator( Quote , dag);\n\n    RangeKeyVal String, Double  highlow = getHighLowOperator( HighLow , dag, appWindowCountMinute);\n    SumKeyVal String, Long  minuteVolume = getMinuteVolumeOperator( MinuteVolume , dag, appWindowCountMinute);\n    ConsolidatorKeyVal String,HighLow,Long,?,?,?  chartOperator = getChartOperator( Chart , dag);\n\n    SimpleMovingAverage String, Double  priceSMA = getPriceSimpleMovingAverageOperator( PriceSMA , dag, appWindowCountSMA);\n\n   
  dag.addStream( price , tick.price, quoteOperator.in1, highlow.data, priceSMA.data);\n    dag.addStream( vol , tick.volume, dailyVolume.data, minuteVolume.data);\n    dag.addStream( time , tick.time, quoteOperator.in3);\n    dag.addStream( daily_vol , dailyVolume.sum, quoteOperator.in2);\n\n    dag.addStream( quote_data , quoteOperator.out, getConsole( quoteConsole , dag,  QUOTE ));\n\n    dag.addStream( high_low , highlow.range, chartOperator.in1);\n    dag.addStream( vol_1min , minuteVolume.sum, chartOperator.in2);\n    dag.addStream( chart_data , chartOperator.out, getConsole( chartConsole , dag,  CHART ));\n\n    dag.addStream( sma_price , priceSMA.doubleSMA, getConsole( priceSMAConsole , dag,  Price SMA ));\n\n    return dag;\n  }\n}  Property File API  The platform also supports specification of a DAG via a property\nfile. The aim here to make it easy for tools to create and run an\napplication. This method of specification does not have the Java\ncompiler support of compile t
 ime check, but since these applications\nwould be created by software, they should be correct by construction.\nThe syntax is derived from Hadoop properties and should be easy for\nfolks who are used to creating software that integrated with\nHadoop.  Create an application (DAG): myApplication.properties  # input operator that reads from a file\ndt.operator.inputOp.classname=com.acme.SampleInputOperator\ndt.operator.inputOp.fileName=somefile.txt\n\n# output operator that writes to the console\ndt.operator.outputOp.classname=com.acme.ConsoleOutputOperator\n\n# stream connecting both operators\ndt.stream.inputStream.source=inputOp.outputPort\ndt.stream.inputStream.sinks=outputOp.inputPort  Above snippet is intended to convey the basic idea of specifying\nthe DAG without using Java. Operators would come from a predefined\nlibrary and referenced in the specification by class name and port names\n(obtained from the library providers documentation or runtime\nintrospection by tools). For 
 those interested in details, see later\nsections and refer to the  Operation and\nInstallation Guide\u00a0mentioned above.  Attributes  Attributes impact the runtime behavior of the application. They do\nnot impact the functionality. An example of an attribute is application\nname. Setting it changes the application name. Another example is\nstreaming window size. Setting it changes the streaming window size from\nthe default value to the specified value. Users cannot add new\nattributes, they can only choose from the ones that come packaged and\npre-supported by the platform. Details of attributes are covered in the\n Operation and Installation\nGuide.", 
+            "text": "This section introduces the API to write a streaming application.\nThe work involves connecting operators via streams to form the logical\nDAG. The steps are    Instantiate an application (DAG)    (Optional) Set Attributes   Assign application name  Set any other attributes as per application requirements     Create/re-use and instantiate operators   Assign operator name that is unique within the  application  Declare schema upfront for each operator (and thereby its ports)  (Optional) Set properties\u00a0 and attributes on the dag as per specification  Connect ports of operators via streams  Each stream connects one output port of an operator to one or  more input ports of other operators.  (Optional) Set attributes on the streams       Test the application.    There are two methods to create an application, namely Java, and\nProperties file. Java API is for applications being developed by humans,\nand properties file (Hadoop like) is more suited for DAGs gener
 ated by\ntools.", 
             "title": "Application API"
         }, 
         {
+            "location": "/application_development/#java-api", 
+            "text": "The Java API is the most common way to create a streaming\napplication. It is meant for application developers who prefer to\nleverage the features of Java, and the ease of use and enhanced\nproductivity provided by IDEs like NetBeans or Eclipse. Using Java to\nspecify the application provides extra validation abilities of Java\ncompiler, such as compile time checks for type safety at the time of\nwriting the code. Later in this chapter you can read more about\nvalidation support in the platform.  The developer specifies the streaming application by implementing\nthe ApplicationFactory interface, which is how platform tools (CLI etc.)\nrecognize and instantiate applications. Here we show how to create a\nYahoo! Finance application that streams the last trade price of a ticker\nand computes the high and low price in every 1 min window. Run above\n test application\u00a0to execute the\nDAG in local mode within the IDE.  Let us revisit how the Yahoo! Finance test a
 pplication constructs the DAG:  public class Application implements StreamingApplication\n{\n\n  ...\n\n  @Override\n  public void populateDAG(DAG dag, Configuration conf)\n  {\n    dag.getAttributes().attr(DAG.STRAM_WINDOW_SIZE_MILLIS).set(streamingWindowSizeMilliSeconds);\n\n    StockTickInput tick = getStockTickInputOperator( StockTickInput , dag);\n    SumKeyVal String, Long  dailyVolume = getDailyVolumeOperator( DailyVolume , dag);\n    ConsolidatorKeyVal String,Double,Long,String,?,?  quoteOperator = getQuoteOperator( Quote , dag);\n\n    RangeKeyVal String, Double  highlow = getHighLowOperator( HighLow , dag, appWindowCountMinute);\n    SumKeyVal String, Long  minuteVolume = getMinuteVolumeOperator( MinuteVolume , dag, appWindowCountMinute);\n    ConsolidatorKeyVal String,HighLow,Long,?,?,?  chartOperator = getChartOperator( Chart , dag);\n\n    SimpleMovingAverage String, Double  priceSMA = getPriceSimpleMovingAverageOperator( PriceSMA , dag, appWindowCountSMA);\n\n    dag.a
 ddStream( price , tick.price, quoteOperator.in1, highlow.data, priceSMA.data);\n    dag.addStream( vol , tick.volume, dailyVolume.data, minuteVolume.data);\n    dag.addStream( time , tick.time, quoteOperator.in3);\n    dag.addStream( daily_vol , dailyVolume.sum, quoteOperator.in2);\n\n    dag.addStream( quote_data , quoteOperator.out, getConsole( quoteConsole , dag,  QUOTE ));\n\n    dag.addStream( high_low , highlow.range, chartOperator.in1);\n    dag.addStream( vol_1min , minuteVolume.sum, chartOperator.in2);\n    dag.addStream( chart_data , chartOperator.out, getConsole( chartConsole , dag,  CHART ));\n\n    dag.addStream( sma_price , priceSMA.doubleSMA, getConsole( priceSMAConsole , dag,  Price SMA ));\n\n    return dag;\n  }\n}", 
+            "title": "Java API"
+        }, 
+        {
+            "location": "/application_development/#property-file-api", 
+            "text": "The platform also supports specification of a DAG via a property\nfile. The aim here to make it easy for tools to create and run an\napplication. This method of specification does not have the Java\ncompiler support of compile time check, but since these applications\nwould be created by software, they should be correct by construction.\nThe syntax is derived from Hadoop properties and should be easy for\nfolks who are used to creating software that integrated with\nHadoop.  Create an application (DAG): myApplication.properties  # input operator that reads from a file\ndt.operator.inputOp.classname=com.acme.SampleInputOperator\ndt.operator.inputOp.fileName=somefile.txt\n\n# output operator that writes to the console\ndt.operator.outputOp.classname=com.acme.ConsoleOutputOperator\n\n# stream connecting both operators\ndt.stream.inputStream.source=inputOp.outputPort\ndt.stream.inputStream.sinks=outputOp.inputPort  Above snippet is intended to convey the basic idea 
 of specifying\nthe DAG without using Java. Operators would come from a predefined\nlibrary and referenced in the specification by class name and port names\n(obtained from the library providers documentation or runtime\nintrospection by tools). For those interested in details, see later\nsections and refer to the  Operation and\nInstallation Guide\u00a0mentioned above.", 
+            "title": "Property File API"
+        }, 
+        {
+            "location": "/application_development/#attributes", 
+            "text": "Attributes impact the runtime behavior of the application. They do\nnot impact the functionality. An example of an attribute is application\nname. Setting it changes the application name. Another example is\nstreaming window size. Setting it changes the streaming window size from\nthe default value to the specified value. Users cannot add new\nattributes, they can only choose from the ones that come packaged and\npre-supported by the platform. Details of attributes are covered in the\n Operation and Installation\nGuide.", 
+            "title": "Attributes"
+        }, 
+        {
             "location": "/application_development/#operators", 
-            "text": "Operators\u00a0are basic compute units.\nOperators process each incoming tuple and emit zero or more tuples on\noutput ports as per the business logic. The data flow, connectivity,\nfault tolerance (node outage), etc. is taken care of by the platform. As\nan operator developer, all that is needed is to figure out what to do\nwith the incoming tuple and when (and which output port) to send out a\nparticular output tuple. Correctly designed operators will most likely\nget reused. Operator design needs care and foresight. For details, refer\nto the   Operator Developer Guide . As an application developer you need to connect operators\nin a way that it implements your business logic. You may also require\noperator customization for functionality and use attributes for\nperformance/scalability etc.  All operators process tuples asynchronously in a distributed\ncluster. An operator cannot assume or predict the exact time a tuple\nthat it emitted will get consumed by a
  downstream operator. An operator\nalso cannot predict the exact time when a tuple arrives from an upstream\noperator. The only guarantee is that the upstream operators are\nprocessing the current or a future window, i.e. the windowId of upstream\noperator is equals or exceeds its own windowId. Conversely the windowId\nof a downstream operator is less than or equals its own windowId. The\nend of a window operation, i.e. the API call to endWindow on an operator\nrequires that all upstream operators have finished processing this\nwindow. This means that completion of processing a window propagates in\na blocking fashion through an operator. Later sections provides more\ndetails on streams and data flow of tuples.  Each operator has a unique name within the DAG as provided by the\nuser. This is the name of the operator in the logical plan. The name of\nthe operator in the physical plan is an integer assigned to it by STRAM.\nThese integers are use the sequence from 1 to N, where N is t
 otal number\nof physically unique operators in the DAG. \u00a0Following the same rule,\neach partitioned instance of a logical operator has its own integer as\nan id. This id along with the Hadoop container name uniquely identifies\nthe operator in the execution plan of the DAG. The logical names and the\nphysical names are required for web service support. Operators can be\naccessed via both names. These same names are used while interacting\nwith  dtcli\u00a0to access an operator.\nIdeally these names should be self-descriptive. For example in Figure 1,\nthe node named \u201cDaily volume\u201d has a physical identifier of 2.  Operator Interface  Operator interface in a DAG consists of ports,\u00a0properties,\u00a0and attributes.\nOperators interact with other components of the DAG via ports. Functional behavior of the operators\ncan be customized via parameters. Run time performance and physical\ninstantiation is controlled by attributes. Ports and parameters are\nfields (variable
 s) of the Operator class/object, while attributes are\nmeta information that is attached to the operator object via an\nAttributeMap. An operator must have at least one port. Properties are\noptional. Attributes are provided by the platform and always have a\ndefault value that enables normal functioning of operators.  Ports  Ports are connection points by which an operator receives and\nemits tuples. These should be transient objects instantiated in the\noperator object, that implement particular interfaces. Ports should be\ntransient as they contain no state. They have a pre-defined schema and\ncan only be connected to other ports with the same schema. An input port\nneeds to implement the interface  Operator.InputPort\u00a0and\ninterface Sink. A default\nimplementation of these is provided by the abstract class DefaultInputPort. An output port needs to\nimplement the interface  Operator.OutputPort. A default implementation\nof this is provided by the concrete class DefaultOutputP
 ort. These two are a quick way to\nimplement the above interfaces, but operator developers have the option\nof providing their own implementations.  Here are examples of an input and an output port from the operator\nSum.  @InputPortFieldAnnotation(name =  data )\npublic final transient DefaultInputPort V  data = new DefaultInputPort V () {\n  @Override\n  public void process(V tuple)\n  {\n    ...\n  }\n}\n@OutputPortFieldAnnotation(optional=true)\npublic final transient DefaultOutputPort V  sum = new DefaultOutputPort V (){ \u2026 };  The process call is in the Sink interface. An emit on an output\nport is done via emit(tuple) call. For the above example it would be\nsum.emit(t), where the type of t is the generic parameter V.  There is no limit on how many ports an operator can have. However\nany operator must have at least one port. An operator with only one port\nis called an Input Adapter if it has no input port and an Output Adapter\nif it has no output port. These are specia
 l operators needed to get/read\ndata from outside system/source into the application, or push/write data\ninto an outside system/sink. These could be in Hadoop or outside of\nHadoop. These two operators are in essence gateways for the streaming\napplication to communicate with systems outside the application.  Port connectivity can be validated during compile time by adding\nPortFieldAnnotations shown above. By default all ports have to be\nconnected, to allow a port to go unconnected, you need to add\n\u201coptional=true\u201d to the annotation.  Attributes can be specified for ports that affect the runtime\nbehavior. An example of an attribute is parallel partition that specifes\na parallel computation flow per partition. It is described in detail in\nthe Parallel Partitions section. Another example is queue capacity that specifies the buffer size for the\nport. Details of attributes are covered in  Operation and Installation Guide.  Properties  Properties are the abstractions by 
 which functional behavior of an\noperator can be customized. They should be non-transient objects\ninstantiated in the operator object. They need to be non-transient since\nthey are part of the operator state and re-construction of the operator\nobject from its checkpointed state must restore the operator to the\ndesired state. Properties are optional, i.e. an operator may or may not\nhave properties; they are part of user code and their values are not\ninterpreted by the platform in any way.  All non-serializable objects should be declared transient.\nExamples include sockets, session information, etc. These objects should\nbe initialized during setup call, which is called every time the\noperator is initialized.  Attributes  Attributes are values assigned to the operators that impact\nrun-time. This includes things like the number of partitions, at most\nonce or at least once or exactly once recovery modes, etc. Attributes do\nnot impact functionality of the operator. Users can ch
 ange certain\nattributes in runtime. Users cannot add attributes to operators; they\nare pre-defined by the platform. They are interpreted by the platform\nand thus cannot be defined in user created code (like properties).\nDetails of attributes are covered in   Configuration Guide .  Operator State  The state of an operator is defined as the data that it transfers\nfrom one window to a future window. Since the computing model of the\nplatform is to treat windows like micro-batches, the operator state can\nbe checkpointed every Nth window, or every T units of time, where T is significantly greater\nthan the streaming window. \u00a0When an operator is checkpointed, the entire\nobject is written to HDFS. \u00a0The larger the amount of state in an\noperator, the longer it takes to recover from a failure. A stateless\noperator can recover much quicker than a stateful one. The needed\nwindows are preserved by the upstream buffer server and are used to\nrecompute the lost windows, and als
 o rebuild the buffer server in the\ncurrent container.  The distinction between Stateless and Stateful is based solely on\nthe need to transfer data in the operator from one window to the next.\nThe state of an operator is independent of the number of ports.  Stateless  A Stateless operator is defined as one where no data is needed to\nbe kept at the end of every window. This means that all the computations\nof a window can be derived from all the tuples the operator receives\nwithin that window. This guarantees that the output of any window can be\nreconstructed by simply replaying the tuples that arrived in that\nwindow. Stateless operators are more efficient in terms of fault\ntolerance, and cost to achieve SLA.  Stateful  A Stateful operator is defined as one where data is needed to be\nstored at the end of a window for computations occurring in later\nwindow; a common example is the computation of a sum of values in the\ninput tuples.  Operator API  The Operator API consists of
  methods that operator developers may\nneed to override. In this section we will discuss the Operator APIs from\nthe point of view of an application developer. Knowledge of how an\noperator works internally is critical for writing an application. Those\ninterested in the details should refer to  Malhar Operator Developer Guide.  The APIs are available in three modes, namely Single Streaming\nWindow, Sliding Application Window, and Aggregate Application Window.\nThese are not mutually exclusive, i.e. an operator can use single\nstreaming window as well as sliding application window. A physical\ninstance of an operator is always processing tuples from a single\nwindow. The processing of tuples is guaranteed to be sequential, no\nmatter which input port the tuples arrive on.  In the later part of this section we will evaluate three common\nuses of streaming windows by applications. They have different\ncharacteristics and implications on optimization and recovery mechanisms\n(i.e. algo
 rithm used to recover a node after outage) as discussed later\nin the section.  Streaming Window  Streaming window is atomic micro-batch computation period. The API\nmethods relating to a streaming window are as follows  public void process( tuple_type  tuple) // Called on the input port on which the tuple arrives\npublic void beginWindow(long windowId) // Called at the start of the window as soon as the first begin_window tuple arrives\npublic void endWindow() // Called at the end of the window after end_window tuples arrive on all input ports\npublic void setup(OperatorContext context) // Called once during initialization of the operator\npublic void teardown() // Called once when the operator is being shutdown  A tuple can be emitted in any of the three streaming run-time\ncalls, namely beginWindow, process, and endWindow but not in setup or\nteardown.  Aggregate Application Window  An operator with an aggregate window is stateful within the\napplication window timeframe and poss
 ibly stateless at the end of that\napplication window. An size of an aggregate application window is an\noperator attribute and is defined as a multiple of the streaming window\nsize. The platform recognizes this attribute and optimizes the operator.\nThe beginWindow, and endWindow calls are not invoked for those streaming\nwindows that do not align with the application window. For example in\ncase of streaming window of 0.5 second and application window of 5\nminute, an application window spans 600 streaming windows (5*60*2 =\n600). At the start of the sequence of these 600 atomic streaming\nwindows, a beginWindow gets invoked, and at the end of these 600\nstreaming windows an endWindow gets invoked. All the intermediate\nstreaming windows do not invoke beginWindow or endWindow. Bookkeeping,\nnode recovery, stats, UI, etc. continue to work off streaming windows.\nFor example if operators are being checkpointed say on an average every\n30th window, then the above application window 
 would have about 20\ncheckpoints.  Sliding Application Window  A sliding window is computations that requires previous N\nstreaming windows. After each streaming window the Nth past window is\ndropped and the new window is added to the computation. An operator with\nsliding window is a stateful operator at end of any window. The sliding\nwindow period is an attribute and is a multiple of streaming window. The\nplatform recognizes this attribute and leverages it during bookkeeping.\nA sliding aggregate window with tolerance to data loss does not have a\nvery high bookkeeping cost. The cost of all three recovery mechanisms,\n at most once\u00a0(data loss tolerant),\nat least once\u00a0(data loss\nintolerant), and exactly once\u00a0(data\nloss intolerant and no extra computations) is same as recovery\nmechanisms based on streaming window. STRAM is not able to leverage this\noperator for any extra optimization.  Single vs Multi-Input Operator  A single-input operator by definition has a
  single upstream\noperator, since there can only be one writing port for a stream. \u00a0If an\noperator has a single upstream operator, then the beginWindow on the\nupstream also blocks the beginWindow of the single-input operator. For\nan operator to start processing any window at least one upstream\noperator has to start processing that window. A multi-input operator\nreads from more than one upstream ports. Such an operator would start\nprocessing as soon as the first begin_window event arrives. However the\nwindow would not close (i.e. invoke endWindow) till all ports receive\nend_window events for that windowId. Thus the end of a window is a\nblocking event. As we saw earlier, a multi-input operator is also the\npoint in the DAG where windows of all upstream operators are\nsynchronized. The windows (atomic micro-batches) from a faster (or just\nahead in processing) upstream operators are queued up till the slower\nupstream operator catches up. STRAM monitors such bottlenecks a
 nd takes\ncorrective actions. The platform ensures minimal delay, i.e processing\nstarts as long as at least one upstream operator has started\nprocessing.  Recovery Mechanisms  Application developers can set any of the recovery mechanisms\nbelow to deal with node outage. In general, the cost of recovery depends\non the state of the operator, while data integrity is dependant on the\napplication. The mechanisms are per window as the platform treats\nwindows as atomic compute units. Three recovery mechanisms are\nsupported, namely   At-least-once: All atomic batches are processed at least once.\n    No data loss occurs.  At-most-once: All atomic batches are processed at most once.\n    Data loss is possible; this is the most efficient setting.  Exactly-once: All atomic batches are processed exactly once.\n    No data loss occurs; this is the least efficient setting since\n    additional work is needed to ensure proper semantics.   At-least-once is the default. During a recovery event
 , the\noperator connects to the upstream buffer server and asks for windows to\nbe replayed. At-least-once and exactly-once mechanisms start from its\ncheckpointed state. At-most-once starts from the next begin-window\nevent.  Recovery mechanisms can be specified per Operator while writing\nthe application as shown below.  Operator o = dag.addOperator(\u201coperator\u201d, \u2026);\ndag.setAttribute(o,  OperatorContext.PROCESSING_MODE,  ProcessingMode.AT_MOST_ONCE);  Also note that once an operator is attributed to AT_MOST_ONCE,\nall the operators downstream to it have to be AT_MOST_ONCE. The client\nwill give appropriate warnings or errors if that\u2019s not the case.  Details are explained in the chapter on Fault Tolerance below.", 
+            "text": "Operators\u00a0are basic compute units.\nOperators process each incoming tuple and emit zero or more tuples on\noutput ports as per the business logic. The data flow, connectivity,\nfault tolerance (node outage), etc. is taken care of by the platform. As\nan operator developer, all that is needed is to figure out what to do\nwith the incoming tuple and when (and which output port) to send out a\nparticular output tuple. Correctly designed operators will most likely\nget reused. Operator design needs care and foresight. For details, refer\nto the   Operator Developer Guide . As an application developer you need to connect operators\nin a way that it implements your business logic. You may also require\noperator customization for functionality and use attributes for\nperformance/scalability etc.  All operators process tuples asynchronously in a distributed\ncluster. An operator cannot assume or predict the exact time a tuple\nthat it emitted will get consumed by a
  downstream operator. An operator\nalso cannot predict the exact time when a tuple arrives from an upstream\noperator. The only guarantee is that the upstream operators are\nprocessing the current or a future window, i.e. the windowId of upstream\noperator is equals or exceeds its own windowId. Conversely the windowId\nof a downstream operator is less than or equals its own windowId. The\nend of a window operation, i.e. the API call to endWindow on an operator\nrequires that all upstream operators have finished processing this\nwindow. This means that completion of processing a window propagates in\na blocking fashion through an operator. Later sections provides more\ndetails on streams and data flow of tuples.  Each operator has a unique name within the DAG as provided by the\nuser. This is the name of the operator in the logical plan. The name of\nthe operator in the physical plan is an integer assigned to it by STRAM.\nThese integers are use the sequence from 1 to N, where N is t
 otal number\nof physically unique operators in the DAG. \u00a0Following the same rule,\neach partitioned instance of a logical operator has its own integer as\nan id. This id along with the Hadoop container name uniquely identifies\nthe operator in the execution plan of the DAG. The logical names and the\nphysical names are required for web service support. Operators can be\naccessed via both names. These same names are used while interacting\nwith  dtcli\u00a0to access an operator.\nIdeally these names should be self-descriptive. For example in Figure 1,\nthe node named \u201cDaily volume\u201d has a physical identifier of 2.", 
             "title": "Operators"
         }, 
         {
+            "location": "/application_development/#operator-interface", 
+            "text": "Operator interface in a DAG consists of ports,\u00a0properties,\u00a0and attributes.\nOperators interact with other components of the DAG via ports. Functional behavior of the operators\ncan be customized via parameters. Run time performance and physical\ninstantiation is controlled by attributes. Ports and parameters are\nfields (variables) of the Operator class/object, while attributes are\nmeta information that is attached to the operator object via an\nAttributeMap. An operator must have at least one port. Properties are\noptional. Attributes are provided by the platform and always have a\ndefault value that enables normal functioning of operators.", 
+            "title": "Operator Interface"
+        }, 
+        {
+            "location": "/application_development/#ports", 
+            "text": "Ports are connection points by which an operator receives and\nemits tuples. These should be transient objects instantiated in the\noperator object, that implement particular interfaces. Ports should be\ntransient as they contain no state. They have a pre-defined schema and\ncan only be connected to other ports with the same schema. An input port\nneeds to implement the interface  Operator.InputPort\u00a0and\ninterface Sink. A default\nimplementation of these is provided by the abstract class DefaultInputPort. An output port needs to\nimplement the interface  Operator.OutputPort. A default implementation\nof this is provided by the concrete class DefaultOutputPort. These two are a quick way to\nimplement the above interfaces, but operator developers have the option\nof providing their own implementations.  Here are examples of an input and an output port from the operator\nSum.  @InputPortFieldAnnotation(name =  data )\npublic final transient DefaultInputPort V 
  data = new DefaultInputPort V () {\n  @Override\n  public void process(V tuple)\n  {\n    ...\n  }\n}\n@OutputPortFieldAnnotation(optional=true)\npublic final transient DefaultOutputPort V  sum = new DefaultOutputPort V (){ \u2026 };  The process call is in the Sink interface. An emit on an output\nport is done via emit(tuple) call. For the above example it would be\nsum.emit(t), where the type of t is the generic parameter V.  There is no limit on how many ports an operator can have. However\nany operator must have at least one port. An operator with only one port\nis called an Input Adapter if it has no input port and an Output Adapter\nif it has no output port. These are special operators needed to get/read\ndata from outside system/source into the application, or push/write data\ninto an outside system/sink. These could be in Hadoop or outside of\nHadoop. These two operators are in essence gateways for the streaming\napplication to communicate with systems outside the applicati
 on.  Port connectivity can be validated during compile time by adding\nPortFieldAnnotations shown above. By default all ports have to be\nconnected, to allow a port to go unconnected, you need to add\n\u201coptional=true\u201d to the annotation.  Attributes can be specified for ports that affect the runtime\nbehavior. An example of an attribute is parallel partition that specifes\na parallel computation flow per partition. It is described in detail in\nthe Parallel Partitions section. Another example is queue capacity that specifies the buffer size for the\nport. Details of attributes are covered in  Operation and Installation Guide.", 
+            "title": "Ports"
+        }, 
+        {
+            "location": "/application_development/#properties", 
+            "text": "Properties are the abstractions by which functional behavior of an\noperator can be customized. They should be non-transient objects\ninstantiated in the operator object. They need to be non-transient since\nthey are part of the operator state and re-construction of the operator\nobject from its checkpointed state must restore the operator to the\ndesired state. Properties are optional, i.e. an operator may or may not\nhave properties; they are part of user code and their values are not\ninterpreted by the platform in any way.  All non-serializable objects should be declared transient.\nExamples include sockets, session information, etc. These objects should\nbe initialized during setup call, which is called every time the\noperator is initialized.", 
+            "title": "Properties"
+        }, 
+        {
+            "location": "/application_development/#attributes_1", 
+            "text": "Attributes are values assigned to the operators that impact\nrun-time. This includes things like the number of partitions, at most\nonce or at least once or exactly once recovery modes, etc. Attributes do\nnot impact functionality of the operator. Users can change certain\nattributes in runtime. Users cannot add attributes to operators; they\nare pre-defined by the platform. They are interpreted by the platform\nand thus cannot be defined in user created code (like properties).\nDetails of attributes are covered in   Configuration Guide .", 
+            "title": "Attributes"
+        }, 
+        {
+            "location": "/application_development/#operator-state", 
+            "text": "The state of an operator is defined as the data that it transfers\nfrom one window to a future window. Since the computing model of the\nplatform is to treat windows like micro-batches, the operator state can\nbe checkpointed every Nth window, or every T units of time, where T is significantly greater\nthan the streaming window. \u00a0When an operator is checkpointed, the entire\nobject is written to HDFS. \u00a0The larger the amount of state in an\noperator, the longer it takes to recover from a failure. A stateless\noperator can recover much quicker than a stateful one. The needed\nwindows are preserved by the upstream buffer server and are used to\nrecompute the lost windows, and also rebuild the buffer server in the\ncurrent container.  The distinction between Stateless and Stateful is based solely on\nthe need to transfer data in the operator from one window to the next.\nThe state of an operator is independent of the number of ports.", 
+            "title": "Operator State"
+        }, 
+        {
+            "location": "/application_development/#stateless", 
+            "text": "A Stateless operator is defined as one where no data is needed to\nbe kept at the end of every window. This means that all the computations\nof a window can be derived from all the tuples the operator receives\nwithin that window. This guarantees that the output of any window can be\nreconstructed by simply replaying the tuples that arrived in that\nwindow. Stateless operators are more efficient in terms of fault\ntolerance, and cost to achieve SLA.", 
+            "title": "Stateless"
+        }, 
+        {
+            "location": "/application_development/#stateful", 
+            "text": "A Stateful operator is defined as one where data is needed to be\nstored at the end of a window for computations occurring in later\nwindow; a common example is the computation of a sum of values in the\ninput tuples.", 
+            "title": "Stateful"
+        }, 
+        {
+            "location": "/application_development/#operator-api", 
+            "text": "The Operator API consists of methods that operator developers may\nneed to override. In this section we will discuss the Operator APIs from\nthe point of view of an application developer. Knowledge of how an\noperator works internally is critical for writing an application. Those\ninterested in the details should refer to  Malhar Operator Developer Guide.  The APIs are available in three modes, namely Single Streaming\nWindow, Sliding Application Window, and Aggregate Application Window.\nThese are not mutually exclusive, i.e. an operator can use single\nstreaming window as well as sliding application window. A physical\ninstance of an operator is always processing tuples from a single\nwindow. The processing of tuples is guaranteed to be sequential, no\nmatter which input port the tuples arrive on.  In the later part of this section we will evaluate three common\nuses of streaming windows by applications. They have different\ncharacteristics and implications on
  optimization and recovery mechanisms\n(i.e. algorithm used to recover a node after outage) as discussed later\nin the section.", 
+            "title": "Operator API"
+        }, 
+        {
+            "location": "/application_development/#streaming-window", 
+            "text": "Streaming window is atomic micro-batch computation period. The API\nmethods relating to a streaming window are as follows  public void process( tuple_type  tuple) // Called on the input port on which the tuple arrives\npublic void beginWindow(long windowId) // Called at the start of the window as soon as the first begin_window tuple arrives\npublic void endWindow() // Called at the end of the window after end_window tuples arrive on all input ports\npublic void setup(OperatorContext context) // Called once during initialization of the operator\npublic void teardown() // Called once when the operator is being shutdown  A tuple can be emitted in any of the three streaming run-time\ncalls, namely beginWindow, process, and endWindow but not in setup or\nteardown.", 
+            "title": "Streaming Window"
+        }, 
+        {
+            "location": "/application_development/#aggregate-application-window", 
+            "text": "An operator with an aggregate window is stateful within the\napplication window timeframe and possibly stateless at the end of that\napplication window. An size of an aggregate application window is an\noperator attribute and is defined as a multiple of the streaming window\nsize. The platform recognizes this attribute and optimizes the operator.\nThe beginWindow, and endWindow calls are not invoked for those streaming\nwindows that do not align with the application window. For example in\ncase of streaming window of 0.5 second and application window of 5\nminute, an application window spans 600 streaming windows (5*60*2 =\n600). At the start of the sequence of these 600 atomic streaming\nwindows, a beginWindow gets invoked, and at the end of these 600\nstreaming windows an endWindow gets invoked. All the intermediate\nstreaming windows do not invoke beginWindow or endWindow. Bookkeeping,\nnode recovery, stats, UI, etc. continue to work off streaming windows.\nF
 or example if operators are being checkpointed say on an average every\n30th window, then the above application window would have about 20\ncheckpoints.", 
+            "title": "Aggregate Application Window"
+        }, 
+        {
+            "location": "/application_development/#sliding-application-window", 
+            "text": "A sliding window is computations that requires previous N\nstreaming windows. After each streaming window the Nth past window is\ndropped and the new window is added to the computation. An operator with\nsliding window is a stateful operator at end of any window. The sliding\nwindow period is an attribute and is a multiple of streaming window. The\nplatform recognizes this attribute and leverages it during bookkeeping.\nA sliding aggregate window with tolerance to data loss does not have a\nvery high bookkeeping cost. The cost of all three recovery mechanisms,\n at most once\u00a0(data loss tolerant),\nat least once\u00a0(data loss\nintolerant), and exactly once\u00a0(data\nloss intolerant and no extra computations) is same as recovery\nmechanisms based on streaming window. STRAM is not able to leverage this\noperator for any extra optimization.", 
+            "title": "Sliding Application Window"
+        }, 
+        {
+            "location": "/application_development/#single-vs-multi-input-operator", 
+            "text": "A single-input operator by definition has a single upstream\noperator, since there can only be one writing port for a stream. \u00a0If an\noperator has a single upstream operator, then the beginWindow on the\nupstream also blocks the beginWindow of the single-input operator. For\nan operator to start processing any window at least one upstream\noperator has to start processing that window. A multi-input operator\nreads from more than one upstream ports. Such an operator would start\nprocessing as soon as the first begin_window event arrives. However the\nwindow would not close (i.e. invoke endWindow) till all ports receive\nend_window events for that windowId. Thus the end of a window is a\nblocking event. As we saw earlier, a multi-input operator is also the\npoint in the DAG where windows of all upstream operators are\nsynchronized. The windows (atomic micro-batches) from a faster (or just\nahead in processing) upstream operators are queued up till the slower\
 nupstream operator catches up. STRAM monitors such bottlenecks and takes\ncorrective actions. The platform ensures minimal delay, i.e processing\nstarts as long as at least one upstream operator has started\nprocessing.", 
+            "title": "Single vs Multi-Input Operator"
+        }, 
+        {
+            "location": "/application_development/#recovery-mechanisms", 
+            "text": "Application developers can set any of the recovery mechanisms\nbelow to deal with node outage. In general, the cost of recovery depends\non the state of the operator, while data integrity is dependant on the\napplication. The mechanisms are per window as the platform treats\nwindows as atomic compute units. Three recovery mechanisms are\nsupported, namely   At-least-once: All atomic batches are processed at least once.\n    No data loss occurs.  At-most-once: All atomic batches are processed at most once.\n    Data loss is possible; this is the most efficient setting.  Exactly-once: All atomic batches are processed exactly once.\n    No data loss occurs; this is the least efficient setting since\n    additional work is needed to ensure proper semantics.   At-least-once is the default. During a recovery event, the\noperator connects to the upstream buffer server and asks for windows to\nbe replayed. At-least-once and exactly-once mechanisms start from its\ncheckp
 ointed state. At-most-once starts from the next begin-window\nevent.  Recovery mechanisms can be specified per Operator while writing\nthe application as shown below.  Operator o = dag.addOperator(\u201coperator\u201d, \u2026);\ndag.setAttribute(o,  OperatorContext.PROCESSING_MODE,  ProcessingMode.AT_MOST_ONCE);  Also note that once an operator is attributed to AT_MOST_ONCE,\nall the operators downstream to it have to be AT_MOST_ONCE. The client\nwill give appropriate warnings or errors if that\u2019s not the case.  Details are explained in the chapter on Fault Tolerance below.", 
+            "title": "Recovery Mechanisms"
+        }, 
+        {
             "location": "/application_development/#streams", 
             "text": "A stream\u00a0is a connector\n(edge) abstraction, and is a fundamental building block of the platform.\nA stream consists of tuples that flow from one port (called the\noutput\u00a0port) to one or more ports\non other operators (called  input\u00a0ports) another -- so note a potentially\nconfusing aspect of this terminology: tuples enter a stream through its\noutput port and leave via one or more input ports. A stream has the\nfollowing characteristics   Tuples are always delivered in the same order in which they\n    were emitted.  Consists of a sequence of windows one after another. Each\n    window being a collection of in-order tuples.  A stream that connects two containers passes through a\n    buffer server.  All streams can be persisted (by default in HDFS).  Exactly one output port writes to the stream.  Can be read by one or more input ports.  Connects operators within an application, not outside\n    an application.  Has an unique name within an applic
 ation.  Has attributes which act as hints to STRAM.   Streams have four modes, namely in-line, in-node, in-rack,\n    and other. Modes may be overruled (for example due to lack\n    of containers). They are defined as follows:   THREAD_LOCAL: In the same thread, uses thread\n    stack (intra-thread). This mode can only be used for a downstream\n    operator which has only one input port connected; also called\n    in-line.  CONTAINER_LOCAL: In the same container (intra-process); also\n    called in-container.  NODE_LOCAL: In the same Hadoop node (inter processes, skips\n    NIC); also called in-node.  RACK_LOCAL: On nodes in the same rack; also called\n    in-rack.  unspecified: No guarantee. Could be anywhere within the\n    cluster     An example of a stream declaration is given below  DAG dag = new DAG();\n \u2026\ndag.addStream( views , viewAggregate.sum, cost.data).setLocality(CONTAINER_LOCAL); // A container local  stream\ndag.addStream(\u201cclicks\u201d, clickAggregate.sum, 
 rev.data); // An example of unspecified locality  The platform guarantees in-order delivery of tuples in a stream.\nSTRAM views each stream as collection of ordered windows. Since no tuple\ncan exist outside a window, a replay of a stream consists of replay of a\nset of windows. When multiple input ports read the same stream, the\nexecution plan of a stream ensures that each input port is logically not\nblocked by the reading of another input port. The schema of a stream is\nsame as the schema of the tuple.  In a stream all tuples emitted by an operator in a window belong\nto that window. A replay of this window would consists of an in-order\nreplay of all the tuples. Thus the tuple order within a stream is\nguaranteed. However since an operator may receive multiple streams (for\nexample an operator with two input ports), the order of arrival of two\ntuples belonging to different streams is not guaranteed. In general in\nan asynchronous distributed architecture this is expected. Thu
 s the\noperator (specially one with multiple input ports) should not depend on\nthe tuple order from two streams. One way to cope with this\nindeterminate order, if necessary, is to wait to get all the tuples of a\nwindow and emit results in endWindow call. All operator templates\nprovided as part of Malhar operator library follow these principles.  A logical stream gets partitioned into physical streams each\nconnecting the partition to the upstream operator. If two different\nattributes are needed on the same stream, it should be split using\nStreamDuplicator\u00a0operator.  Modes of the streams are critical for performance. An in-line\nstream is the most optimal as it simply delivers the tuple as-is without\nserialization-deserialization. Streams should be marked\ncontainer_local, specially in case where there is a large tuple volume\nbetween two operators which then on drops significantly. Since the\nsetLocality call merely provides a hint, STRAM may ignore it. An In-node\nstrea
 m is not as efficient as an in-line one, but it is clearly better\nthan going off-node since it still avoids the potential bottleneck of\nthe network card.  THREAD_LOCAL and CONTAINER_LOCAL streams do not use a buffer\nserver as this stream is in a single process. The other two do.", 
             "title": "Streams"
         }, 
         {
             "location": "/application_development/#validating-an-application", 
-            "text": "The platform provides various ways of validating the application\nspecification and data input. An understanding of these checks is very\nimportant for an application developer since it affects productivity.\nValidation of an application is done in three phases, namely   Compile Time: Caught during application development, and is\n    most cost effective. These checks are mainly done on declarative\n    objects and leverages the Java compiler. An example is checking that\n    the schemas specified on all ports of a stream are\n    mutually compatible.  Initialization Time: When the application is being\n    initialized, before submitting to Hadoop. These checks are related\n    to configuration/context of an application, and are done by the\n    logical DAG builder implementation. An example is the checking that\n    all non-optional ports are connected to other ports.  Run Time: Validations done when the application is running.\n    This is the costliest of all
  checks. These are checks that can only\n    be done at runtime as they involve data. For example divide by 0\n    check as part of business logic.   Compile Time  Compile time validations apply when an application is specified in\nJava code and include all checks that can be done by Java compiler in\nthe development environment (including IDEs like NetBeans or Eclipse).\nExamples include   Schema Validation: The tuples on ports are POJO (plain old\n    java objects) and compiler checks to ensure that all the ports on a\n    stream have the same schema.  Stream Check: Single Output Port and at least one Input port\n    per stream. A stream can only have one output port writer. This is\n    part of the addStream api. This\n    check ensures that developers only connect one output port to\n    a stream. The same signature also ensures that there is at least one\n    input port for a stream  Naming: Compile time checks ensures that applications\n    components operators, streams are na
 med   Initialization/Instantiation Time  Initialization time validations include various checks that are\ndone post compile, and before the application starts running in a\ncluster (or local mode). These are mainly configuration/contextual in\nnature. These checks are as critical to proper functionality of the\napplication as the compile time validations.  Examples include    JavaBeans Validation :\n    Examples include   @Max(): Value must be less than or equal to the number  @Min(): Value must be greater than or equal to the\n    number  @NotNull: The value of the field or property must not be\n    null  @Pattern(regexp = \u201c....\u201d): Value must match the regular\n    expression  Input port connectivity: By default, every non-optional input\n    port must be connected. A port can be declared optional by using an\n    annotation: \u00a0 \u00a0 @InputPortFieldAnnotation(name = \"...\", optional\n    = true)  Output Port Connectivity: Similar. The annotation here is: \u00a0 \u0
 0a0\n    @OutputPortFieldAnnotation(name = \"...\", optional = true)     Unique names in application scope: Operators, streams, must have\n    unique names.   Cycles in the dag: DAG cannot have a cycle.  Unique names in operator scope: Ports, properties, annotations\n    must have unique names.  One stream per port: A port can connect to only one stream.\n    This check applies to input as well as output ports even though an\n    output port can technically write to two streams. If you must have\n    two streams originating from a single output port, use \u00a0a\u00a0streamDuplicator operator.  Application Window Period: Has to be an integral multiple the\n    streaming window period.   Run Time  Run time checks are those that are done when the application is\nrunning. The real-time streaming platform provides rich run time error\nhandling mechanisms. The checks are exclusively done by the application\nbusiness logic, but the platform allows applications to count and audit\nthese. S
 ome of these features are in the process of development (backend\nand UI) and this section will be updated as they are developed. Upon\ncompletion examples will be added to demos to illustrate these.  Error ports are output ports with error annotations. Since they\nare normal ports, they can be monitored and tuples counted, persisted\nand counts shown in the UI.", 
+            "text": "The platform provides various ways of validating the application\nspecification and data input. An understanding of these checks is very\nimportant for an application developer since it affects productivity.\nValidation of an application is done in three phases, namely   Compile Time: Caught during application development, and is\n    most cost effective. These checks are mainly done on declarative\n    objects and leverages the Java compiler. An example is checking that\n    the schemas specified on all ports of a stream are\n    mutually compatible.  Initialization Time: When the application is being\n    initialized, before submitting to Hadoop. These checks are related\n    to configuration/context of an application, and are done by the\n    logical DAG builder implementation. An example is the checking that\n    all non-optional ports are connected to other ports.  Run Time: Validations done when the application is running.\n    This is the costliest of all
  checks. These are checks that can only\n    be done at runtime as they involve data. For example divide by 0\n    check as part of business logic.", 
             "title": "Validating an Application"
         }, 
         {
+            "location": "/application_development/#compile-time", 
+            "text": "Compile time validations apply when an application is specified in\nJava code and include all checks that can be done by Java compiler in\nthe development environment (including IDEs like NetBeans or Eclipse).\nExamples include   Schema Validation: The tuples on ports are POJO (plain old\n    java objects) and compiler checks to ensure that all the ports on a\n    stream have the same schema.  Stream Check: Single Output Port and at least one Input port\n    per stream. A stream can only have one output port writer. This is\n    part of the addStream api. This\n    check ensures that developers only connect one output port to\n    a stream. The same signature also ensures that there is at least one\n    input port for a stream  Naming: Compile time checks ensures that applications\n    components operators, streams are named", 
+            "title": "Compile Time"
+        }, 
+        {
+            "location": "/application_development/#initializationinstantiation-time", 
+            "text": "Initialization time validations include various checks that are\ndone post compile, and before the application starts running in a\ncluster (or local mode). These are mainly configuration/contextual in\nnature. These checks are as critical to proper functionality of the\napplication as the compile time validations.  Examples include    JavaBeans Validation :\n    Examples include   @Max(): Value must be less than or equal to the number  @Min(): Value must be greater than or equal to the\n    number  @NotNull: The value of the field or property must not be\n    null  @Pattern(regexp = \u201c....\u201d): Value must match the regular\n    expression  Input port connectivity: By default, every non-optional input\n    port must be connected. A port can be declared optional by using an\n    annotation: \u00a0 \u00a0 @InputPortFieldAnnotation(name = \"...\", optional\n    = true)  Output Port Connectivity: Similar. The annotation here is: \u00a0 \u00a0\n    @OutputPort
 FieldAnnotation(name = \"...\", optional = true)     Unique names in application scope: Operators, streams, must have\n    unique names.   Cycles in the dag: DAG cannot have a cycle.  Unique names in operator scope: Ports, properties, annotations\n    must have unique names.  One stream per port: A port can connect to only one stream.\n    This check applies to input as well as output ports even though an\n    output port can technically write to two streams. If you must have\n    two streams originating from a single output port, use \u00a0a\u00a0streamDuplicator operator.  Application Window Period: Has to be an integral multiple the\n    streaming window period.", 
+            "title": "Initialization/Instantiation Time"
+        }, 
+        {
+            "location": "/application_development/#run-time", 
+            "text": "Run time checks are those that are done when the application is\nrunning. The real-time streaming platform provides rich run time error\nhandling mechanisms. The checks are exclusively done by the application\nbusiness logic, but the platform allows applications to count and audit\nthese. Some of these features are in the process of development (backend\nand UI) and this section will be updated as they are developed. Upon\ncompletion examples will be added to demos to illustrate these.  Error ports are output ports with error annotations. Since they\nare normal ports, they can be monitored and tuples counted, persisted\nand counts shown in the UI.", 
+            "title": "Run Time"
+        }, 
+        {
             "location": "/application_development/#multi-tenancy-and-security", 
             "text": "Hadoop is a multi-tenant distributed operating system. Security is\nan intrinsic element of multi-tenancy as without it a cluster cannot be\nreasonably be shared among enterprise applications. Streaming\napplications follow all multi-tenancy security models used in Hadoop as\nthey are native Hadoop applications.", 
             "title": "Multi-Tenancy and Security"
@@ -147,10 +307,40 @@
         }, 
         {
             "location": "/application_development/#partitioning", 
-            "text": "If all tuples sent through the stream(s) that are connected to the\ninput port(s) of an operator in the DAG are received by a single\nphysical instance of that operator, that operator can become a\nperformance bottleneck. This leads to scalability issues when\nthroughput, memory, or CPU needs exceed the processing capacity of that\nsingle instance.  To address the problem, the platform offers the capability to\npartition the inflow of data so that it is divided across multiple\nphysical instances of a logical operator in the DAG. There are two\nfunctional ways to partition   Load balance: Incoming load is simply partitioned\n    into stream(s) that go to separate instances of physical operators\n    and scalability is achieved via adding more physical operators. Each\n    tuple is sent to physical operator (partition) based on a\n    round-robin or other similar algorithm. This scheme scales linearly.\n    A lot of key based computations can load balance in the 
 platform due\n    to the ability to insert  Unifiers. For many computations, the\n    endWindow and Unifier setup is similar to the combiner and reducer\n    mechanism in a Map-Reduce computation.  Sticky Key: The key assertion is that distribution of tuples\n    are sticky, i.e the data with\n    same key will always be processed by the same physical operator, no\n    matter how many times it is sent through the stream. This stickiness\n    will continue even if the number of partitions grows dynamically and\n    can eventually be leveraged for advanced features like\n    bucket testing. How this is accomplished and what is required to\n    develop compliant operators will be explained below.   We plan to add more partitioning mechanisms proactively to the\nplatform over time as needed by emerging usage patterns. The aim is to\nallow enterprises to be able to focus on their business logic, and\nsignificantly reduce the cost of operability. As an enabling technology\nfor managing hi
 gh loads, this platform provides enterprises with a\nsignificant innovative edge. Scalability and Partitioning is a\nfoundational building block for this platform.  Sticky Partition vs Round Robin  As noted above, partitioning via sticky key is data aware but\nround-robin partitioning is not. An example for non-sticky load\nbalancing would be round robin distribution over multiple instances,\nwhere for example a tuple stream of  A, A,\nA with 3 physical operator\ninstances would result in processing of a single A by each of the instances, In contrast, sticky\npartitioning means that exactly one instance of the operators will\nprocess all of the  Atuples if they\nfall into the same bucket, while B\nmay be processed by another operator. Data aware mapping of\ntuples to partitions (similar to distributed hash table) is accomplished\nvia Stream Codecs. In later sections we would show how these two\napproaches can be used in combination.  Stream Codec  The platform does not make assumpti
 ons about the tuple\ntype, it could be any Java object. The operator developer knows what\ntuple type an input port expects and is capable of processing. Each\ninput port has a stream codec \u00a0associated thatdefines how data is serialized when transmitted over a socket\nstream; it also defines another\nfunction that computes the partition hash key for the tuple. The engine\nuses that key to determine which physical instance(s) \u00a0(for a\npartitioned operator) receive that \u00a0tuple. For this to work, consistent hashing is required.\nThe default codec uses the Java Object#hashCode function, which is\nsufficient for basic types such as Integer, String etc. It will also\nwork with custom tuple classes as long as they implement hashCode\nappropriately. Reliance on hashCode may not work when generic containers\nare used that do not hash the actual data, such as standard collection\nclasses (HashMap etc.), in which case a custom stream codec must be\nassigned to the input port.  S
 tatic Partitioning  DAG designers can specify at design time how they would like\ncertain operators to be partitioned. STRAM then instantiates the DAG\nwith the physical plan which adheres to the partitioning scheme defined\nby the design. This plan is the initial partition of the application. In\nother words, Static Partitioning is used to tell STRAM to compute the\nphysical DAG from a logical DAG once, without taking into consideration\nruntime states or loads of various operators.  Dynamic Partitioning  In streaming applications the load changes during the day, thus\ncreating situations where the number of partitioned operator instances\nneeds to adjust dynamically. The load can be measured in terms of\nprocessing within the DAG based on throughput, or latency, or\nconsiderations in external system components (time based etc.) that the\nplatform may not be aware of. Whatever the trigger, the resource\nrequirement for the current processing needs to be adjusted at run-time.\nThe p
 latform may detect that operator instances are over or under\nutilized and may need to dynamically adjust the number of instances on\nthe fly. More instances of a logical operator may be required (partition\nsplit) or underutilized operator instances may need decommissioning\n(partition merge). We refer to either of the changes as dynamic\npartitioning. The default partitioning scheme supports split and merge\nof partitions, but without state transfer. The contract of the\nPartitioner\u00a0interface allows the operator\ndeveloper to implement split/merge and the associated state transfer, if\nnecessary.  Since partitioning is a key scalability measure, our goal is to\nmake it as simple as possible without removing the flexibility needed\nfor sophisticated applications. Basic partitioning can be enabled at\ncompile time through the DAG specification. A slightly involved\npartitioning involves writing custom codecs to calculate data aware\npartitioning scheme. More complex partitionin
 g cases may require users\nto provide a custom implementation of Partitioner, which gives the\ndeveloper full control over state transfer between multiple instances of\nthe partitioned operator.  Default Partitioning  The platform provides a default partitioning implementation that\ncan be enabled without implementing Partitioner\u00a0(or writing any other extra Java\ncode), which is designed to support simple sticky partitioning out of\nthe box for operators with logic agnostic to the partitioning scheme\nthat can be enabled by means of DAG construction alone.  Typically an operator that can work with the default partitioning\nscheme would have a single input port. If there are multiple input\nports, only one port will be partitioned (the port first connected in\nthe DAG). The number of partitions will be calculat

<TRUNCATED>


Mime
View raw message