gearpump-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kam Kasravi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEARPUMP-86) Application manifest definition and usage
Date Thu, 21 Apr 2016 19:20:25 GMT

    [ https://issues.apache.org/jira/browse/GEARPUMP-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252491#comment-15252491
] 

Kam Kasravi commented on GEARPUMP-86:
-------------------------------------

Imported from [#1455|https://github.com/gearpump/gearpump/issues/1455]

> Application manifest definition and usage
> -----------------------------------------
>
>                 Key: GEARPUMP-86
>                 URL: https://issues.apache.org/jira/browse/GEARPUMP-86
>             Project: Apache Gearpump
>          Issue Type: Bug
>          Components: examples, restapi
>    Affects Versions: 0.8.0
>            Reporter: Kam Kasravi
>             Fix For: 0.8.1
>
>
> h1. GearPump Application and Library Manifest
> h2. Requirements
> This design is to solve a number of issues related to application jar submission and
DAG UI creation and manipulation. The goals are to make it easier or possible to:
>  
> {quote}
> # Submit applications and have them run with zero configuration. Everything required
for them to run including environment specific configuration properties is in the jar. Submission
may be browser or command based.
> # Decrease the likelihood of DAG failure after submission due to misconfiguration, etc.
> # Manipulate or create a DAG easily in a browser without requiring manual input of Tasks.
> # Query a task repository for specific tasks or for a listing of task types that can
be imported / exposed within a DAG Editor.
> # Enable an extensibility mechanism where libraries of tasks may be defined and categorized
and made available to tools like DAG builders
> # Provide a way to reduce jar sizes and possibly share jar dependencies across tasks.
> # Allow categorization of tasks as appropriate for types of flow: source, sink, bidirectional,
multicast, etc. This information may be used by DAG editors to embellish task icons or filter
task selection, etc.
> # Allow a task to be optionally typed with related message types for input and output.
This may allows programmatic generation of flows or even programmatic generation of message
types. 
> # Allow task grouping which would enable sub-graph imports and labeling. A DAG editor
should allow the user to define and package a subgraph. Considering the dynamic DAG usage,
user wants to attach a subgraph to the running DAG. But, how is this subgraph being developed,
defined and packaged? E.g. a data scientist can develop a new model (subgraph) to read from
kafka source and performing scoring. Apparently, when packaging, we need to avoid packaging
the kafka source processor into the package. Otherwise, this model will not run as it will
use training kafka instead of product kakfa.
> # Allow a task to be distributed and run with a minimal set of dependencies.
> h2. Use Cases
> Currently GearPump applications can be submitted as a jar via the browser. To the browser
this jar is opaque. Validation of the application must be done server side, and error details
may be needlessly complex, misleading or uninformative to the end user. Additionally, surfacing
DAG information prior to submitting the application for possible editing is hopelessly convoluted
and would involve: 
> * submitting the jar
> * querying the server to retrieve application details
> * killing the application
> Given the jar may be quite large - having the user wait possibly minutes for upload just
to expose the DAG is impractical and limits functionality required to move forward with critical
features like DAG creation and editing. Rather we require a mechanism that surfaces application
information like its DAG within seconds after selecting the jar on the user's machine. We
also want an ability to minimize the size of GearPump applications even when these applications
may have a massive set of dependencies. Upcoming use cases may require dependencies that are
impractical to resolve using jar inclusion or fat jars. Additionally, the current application
jar does not provide information that could allow individual task distribution with specific
jar allocations. In other words a 100M application jar defining 5 Tasks would require this
jar to be distributed to each GearPump worker even if one of the tasks did nothing but summed
2 fields and sent the result downstream.
> h2. Design
> h3. Manifest details
> h4. Manifest structure
> Central to this design is a manifest specification that describes a GearPump jar. This
specification is a JSON file that can be bundled with all GearPump jars. The creation of this
manifest will either be generated programmatically or could be built manually. Programmatic
generation will be a simple tool or sbt plugin that builds this file as part of application
packaging. Similar to node.js's _[package.json|https://docs.npmjs.com/files/package.json]_
file, this file will provide meta-data describing a GearPump application or library including:
> > - manifest version
> > - manifest type
> > - application main
> > - application name
> > - application version
> > - application configuration
> > - processors
> > - graph
> > - dependencies
> > - repos
> > - user name
> > - keywords
> h4. Manifest name
> Within the jar file, a top level entry of the name gearpump.json will indicate that this
jar is a GearPump related jar. This file will hold contents noted above and will be validated
per its JSON schema included in this document.
> h3. Entries and definitions
> The JSON schema describing the above entries will provide relevant typing that may be
structural or categorical. For example the manifest type entry above would be (APPLICATION|LIBRARY).
It's anticipated that additional information not specified here will be required. For example
a user's organization or perhaps a user's role capability. There could be other security attributes
as well. Updates to the JSON schema that include new information will result in a bump in
the manifest version to reflect the schema change. The manifest version will adhere to the
[semantic versioning specification|http://semver.org/]. It's also anticipated that this manifest
will enable DAG creation and tooling related to types of applications that a DAG creation
tool may want to expose for example possible types of tasks within general categories (source,
sink, bidirectional, multitask etc).
> h3. Dynamic creation of applications
> One impact of providing a manifest is the application main entry noted above in the manifest
structure is now optional. Once a manifest is generated as part of the build we could safely
remove all application mains - like those that are included under the gearpump examples directory.
We **will** require the jar submission REST endpoint to do the equivalent of:
>  - parse the json file 
>  - create a userConfig instance based on the application configuration values
>  - create Processor\[Task\] instances using the tasks information
>  - connect Processor\[Task\] instances based on the graph information and create a Graph
instance
>  - create a StreamingApplication instance using the application name, application config
and Graph instance
>  - submit the StreamingApplication instance to a ClientContext. 
> In other words we require the REST endpoint receiving the jar file to **programmatically**
do something similar to what the WordCount application does below based on the application
manifest.
> {code}
> object WordCount extends AkkaApp with ArgumentsParser {
>   private val LOG: Logger = LogUtil.getLogger(getClass)
>   val RUN_FOR_EVER = -1
>   override val options: Array[(String, CLIOption[Any])] = Array(
>     "split" -> CLIOption[Int]("<how many split tasks>", required = false, defaultValue
= Some(1)),
>     "sum" -> CLIOption[Int]("<how many sum tasks>", required = false, defaultValue
= Some(1))
>   )
>   def application(config: ParseResult) : StreamApplication = {
>     val splitNum = config.getInt("split")
>     val sumNum = config.getInt("sum")
>     val split = Processor[Split](splitNum)
>     val sum = Processor[Sum](sumNum)
>     val partitioner = new HashPartitioner
>     val app = StreamApplication("wordCount", Graph(split ~ partitioner ~> sum), UserConfig.empty)
>     app
>   }
>   override def main(akkaConf: Config, args: Array[String]): Unit = {
>     val config = parse(args)
>     val context = ClientContext(akkaConf)
>     val appId = context.submit(application(config))
>     context.close()
>   }
> }
> {code}
> Note that something like this already exists in GearPump associated with the REST endpoint
**/api/v1.0/submitapp**. This endpoint accepts a POST with a JSON structure that is reified
into a scala case class shown below
> {code}
> case class SubmitApplicationRequest (
>     appName: String,
>     processors: Map[ProcessorId, ProcessorDescription],
>     dag: Graph[Int, String],
>     userconfig: UserConfig = UserConfig.empty)
> {code}
> This is used to submit the application to the master. The manifest is merely a way of
storing the SubmitApplicationRequest with the jar as a JSON file. 
> h3. UI DAG Editor functionality
> h4. Use case: Creating a new DAG
> DAG editors that are browser based will likely address a number of use cases. One key
use case is the ability to select a local application jar and create or edit a DAG based on
its contents.
> h5. Parsing jar files
> An application jar may be parsed immediately within the browser prior to upload by using
[JsZip|https://stuk.github.io/jszip/]. JsZip can be used within any grade A browser or from
the command line, allowing tooling to be created in either area. JsZip is also quite fast,
parsing a 61M jar in a chrome browser on Mac OS X took 2seconds for JsZip to return all entries.
JsZip has a nice set of features:
> * JsZip can read local jars that are selected using the native file dialog
> * JsZip can read remote jar's, allowing importing of Tasks from external repo's other
the origin server.
> * JsZip can create jar files, opening up the possibility of saving the results of building
an application locally to the users machine. An example of parsing a jar file that had been
dropped within a HTML drop zone can be found [here|http://onehungrymind.com/zip-parsing-jszip-angular/].

> h6. Security risks
> Parsing a jar that has been selected by a user using a browser's native dialog is safe
for several reasons:
> # The jar is already resident on the users computer. 
> # Parsing a jar to read the jar entries using JsZip typically uses modern browser's ArrayBuffer
or Uint8Array. Both are intended to deal with binary data and there is no increased risk of
buffer overflows.
> # Parsing a jar file from a remote url rather than locally should only come from sanctioned
repo's. 
> h3. Manifest Definition
> h4. JSON schema (incomplete)
> see [generator|http://jsonschema.net/#/]
> {code}
> {
>   "$schema": "http://json-schema.org/draft-04/schema#",
>   "id": "http://jsonschema.net",
>   "type": "object",
>   "properties": {
>     "manifestVersion": {
>       "id": "http://jsonschema.net/manifestVersion",
>       "type": "string"
>     },
>     "manifestType": {
>       "enum": ["APPLICATION", "LIBRARY"]
>     },
>     "applicationMain": {
>       "id": "http://jsonschema.net/applicationMain",
>       "type": "string"
>     },
>     "applicationName": {
>       "id": "http://jsonschema.net/applicationName",
>       "type": "string"
>     },
>     "applicationConf": {
>       "id": "http://jsonschema.net/processors/11/1/applicationConf",
>       "type": "object",
>        "properties": {
>             "_config": {
>             "id": "http://jsonschema.net/applicationConf/_config",
>             "type": "object",
>             "properties": {}
>          }
>      },
>     "processors": {
>       "id": "http://jsonschema.net/processors",
>       "type": "array",
>       "items": {
>         "id": "http://jsonschema.net/processors/11",
>         "type": "array",
>         "items": {
>           "id": "http://jsonschema.net/processors/11/1",
>           "type": "object",
>           "properties": {
>             "id": {
>               "id": "http://jsonschema.net/processors/11/1/id",
>               "type": "integer"
>             },
>             "taskClass": {
>               "id": "http://jsonschema.net/processors/11/1/taskClass",
>               "type": "string"
>             },
>             "parallelism": {
>               "id": "http://jsonschema.net/processors/11/1/parallelism",
>               "type": "integer"
>             },
>             "description": {
>               "id": "http://jsonschema.net/processors/11/1/description",
>               "type": "string"
>             },
>             "taskConf": {
>               "id": "http://jsonschema.net/processors/11/1/taskConf",
>               "type": "object",
>               "properties": {
>                 "_config": {
>                   "id": "http://jsonschema.net/processors/11/1/taskConf/_config",
>                   "type": "object",
>                   "properties": {}
>                 }
>               }
>             },
>             "life": {
>               "id": "http://jsonschema.net/processors/11/1/life",
>               "type": "object",
>               "properties": {
>                 "birth": {
>                   "id": "http://jsonschema.net/processors/11/1/life/birth",
>                   "type": "string"
>                 },
>                 "death": {
>                   "id": "http://jsonschema.net/processors/11/1/life/death",
>                   "type": "string"
>                 }
>               }
>             },
>             "executors": {
>               "id": "http://jsonschema.net/processors/11/1/executors",
>               "type": "array",
>               "items": {
>                 "id": "http://jsonschema.net/processors/11/1/executors/0",
>                 "type": "integer"
>               }
>             },
>             "taskCount": {
>               "id": "http://jsonschema.net/processors/11/1/taskCount",
>               "type": "array",
>               "items": {
>                 "id": "http://jsonschema.net/processors/11/1/taskCount/0",
>                 "type": "array",
>                 "items": {
>                   "id": "http://jsonschema.net/processors/11/1/taskCount/0/1",
>                   "type": "object",
>                   "properties": {
>                     "count": {
>                       "id": "http://jsonschema.net/processors/11/1/taskCount/0/1/count",
>                       "type": "integer"
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>     },
>     "dag": {
>       "id": "http://jsonschema.net/dag",
>       "type": "object",
>       "properties": {
>         "vertexList": {
>           "id": "http://jsonschema.net/dag/vertexList",
>           "type": "array",
>           "items": {
>             "id": "http://jsonschema.net/dag/vertexList/11",
>             "type": "integer"
>           }
>         },
>         "edgeList": {
>           "id": "http://jsonschema.net/dag/edgeList",
>           "type": "array",
>           "items": {
>             "id": "http://jsonschema.net/dag/edgeList/14",
>             "type": "array",
>             "items": {
>               "id": "http://jsonschema.net/dag/edgeList/14/2",
>               "type": "integer"
>             }
>           }
>         }
>       }
>     },
>     "user": {
>       "id": "http://jsonschema.net/user",
>       "type": "string"
>     }
>   },
>   "required": [
>     "manifestVersion",
>     "manifestType",
>     "applicationName",
>     "processors",
>     "dag",
>     "user"
>   ]
> }
> {code}
> h3. TODO
> * describe each schema entry
> * provide sample output
> * scrub schema and validate
> * detail how the DAG editor needs to generate or update the gearpump.json file.
> * describe how dependencies are included and how this can reduce jar size
> h3. ISSUES
> * how to attach to an existing DAG using this manifest - maybe define an optional applicationId
> * the scope here is quite large and may need to be broken into a number of design tasks
or reference them
> h4. Relevance
> >    - #1450 
> >    - #1437
> h4. Related
> * application specific types like TAP/ATK and types of tasks within an application category
> * akka-streams RunnableGraph which has no concept of serde and cannot be persisted or
distributed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message