Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 84AC4200D22 for ; Fri, 6 Oct 2017 17:51:50 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 831421609D0; Fri, 6 Oct 2017 15:51:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2E7C1160BDB for ; Fri, 6 Oct 2017 17:51:49 +0200 (CEST) Received: (qmail 99121 invoked by uid 500); 6 Oct 2017 15:51:48 -0000 Mailing-List: contact commits-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list commits@beam.apache.org Received: (qmail 99112 invoked by uid 99); 6 Oct 2017 15:51:48 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Oct 2017 15:51:48 +0000 Received: by gitbox.apache.org (ASF Mail Server at gitbox.apache.org, from userid 33) id 0E46981A95; Fri, 6 Oct 2017 15:51:44 +0000 (UTC) Date: Fri, 06 Oct 2017 15:51:42 +0000 To: "commits@beam.apache.org" Subject: [beam-site] 03/04: Regenerate website MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit From: altay@apache.org In-Reply-To: <150730509939.19400.13684528562379522032@gitbox.apache.org> References: <150730509939.19400.13684528562379522032@gitbox.apache.org> X-Git-Host: gitbox.apache.org X-Git-Repo: beam-site X-Git-Refname: refs/heads/asf-site X-Git-Reftype: branch X-Git-Rev: e5a77380d8358288616c476bfee6557d3463bc60 X-Git-NotificationType: diff X-Git-Multimail-Version: 1.5.dev Auto-Submitted: auto-generated Message-Id: <20171006155146.0E46981A95@gitbox.apache.org> archived-at: Fri, 06 Oct 2017 15:51:50 -0000 This is an automated email from the ASF dual-hosted git repository. altay pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam-site.git commit e5a77380d8358288616c476bfee6557d3463bc60 Author: Ahmet Altay AuthorDate: Fri Oct 6 08:50:29 2017 -0700 Regenerate website --- content/documentation/sdks/java/index.html | 1 + content/documentation/sdks/nexmark/index.html | 790 ++++++++++++++++++++++++++ 2 files changed, 791 insertions(+) diff --git a/content/documentation/sdks/java/index.html b/content/documentation/sdks/java/index.html index 6550038..cb39f49 100644 --- a/content/documentation/sdks/java/index.html +++ b/content/documentation/sdks/java/index.html @@ -168,6 +168,7 @@
  • join-library provides inner join, outer left join, and outer right join functions.
  • sorter is an efficient and scalable sorter for large iterables.
  • +
  • Nexmark is a benchmark suite that runs in batch and streaming modes.
diff --git a/content/documentation/sdks/nexmark/index.html b/content/documentation/sdks/nexmark/index.html new file mode 100644 index 0000000..559cdc1 --- /dev/null +++ b/content/documentation/sdks/nexmark/index.html @@ -0,0 +1,790 @@ + + + + + + + Nexmark benchmark suite + + + + + + + + + + + + + + + +
+

Nexmark benchmark suite

+ +

What it is

+ +

Nexmark is a suite of pipelines inspired by the ‘continuous data stream’ +queries in Nexmark research paper

+ +

These are multiple queries over a three entities model representing on online auction system:

+ +
    +
  • Person represents a person submitting an item for auction and/or making a bid + on an auction.
  • +
  • Auction represents an item under auction.
  • +
  • Bid represents a bid for an item under auction.
  • +
+ +

The queries

+ +

The queries exercise many aspects of Beam model:

+ +
    +
  • Query1: What are the bid values in Euro’s? +Illustrates a simple map.
  • +
  • Query2: What are the auctions with particular auction numbers? +Illustrates a simple filter.
  • +
  • Query3: Who is selling in particular US states? +Illustrates an incremental join (using per-key state and timer) and filter.
  • +
  • Query4: What is the average selling price for each auction +category? +Illustrates complex join (using custom window functions) and +aggregation.
  • +
  • Query5: Which auctions have seen the most bids in the last period? +Illustrates sliding windows and combiners.
  • +
  • Query6: What is the average selling price per seller for their +last 10 closed auctions. +Shares the same ‘winning bids’ core as for Query4, and +illustrates a specialized combiner.
  • +
  • Query7: What are the highest bids per period? +Deliberately implemented using a side input to illustrate fanout.
  • +
  • Query8: Who has entered the system and created an auction in +the last period? +Illustrates a simple join.
  • +
+ +

We have augmented the original queries with five more:

+ +
    +
  • Query0: Pass-through. +Allows us to measure the monitoring overhead.
  • +
  • Query9: Winning-bids. +A common sub-query shared by Query4 and Query6.
  • +
  • Query10: Log all events to GCS files. +Illustrates windows with large side effects on firing.
  • +
  • Query11: How many bids did a user make in each session they +were active? +Illustrates session windows.
  • +
  • Query12: How many bids does a user make within a fixed +processing time limit? +Illustrates working in processing time in the Global window, as +compared with event time in non-Global windows for all the other +queries.
  • +
+ +

Benchmark workload configuration

+

Here are some of the knobs of the benchmark workload (see NexmarkConfiguration.java).

+ +

These configuration items can be passed to the launch command line.

+ +

Events generation (defaults)

+
    +
  • 100 000 events generated
  • +
  • 100 generator threads
  • +
  • Event rate in SIN curve
  • +
  • Initial event rate of 10 000
  • +
  • Event rate step of 10 000
  • +
  • 100 concurrent auctions
  • +
  • 1000 concurrent persons bidding / creating auctions
  • +
+ +

Windows (defaults)

+
    +
  • size 10s
  • +
  • sliding period 5s
  • +
  • watermark hold for 0s
  • +
+ +

Events Proportions (defaults)

+
    +
  • Hot Auctions = ½
  • +
  • Hot Bidders =¼
  • +
  • Hot Sellers=¼
  • +
+ +

Technical

+
    +
  • Artificial CPU load
  • +
  • Artificial IO load
  • +
+ +

Nexmark output

+

Here is an example output of the Nexmark benchmark run in streaming mode with the SMOKE suite on the (local) direct runner:

+ +
+Performance:
+  Conf       Runtime(sec)         Events(/sec)         Results
+  0000                5,5              18138,9          100000
+  0001                4,2              23657,4           92000
+  0002                2,2              45683,0             351
+  0003                3,9              25348,5             444
+  0004                1,6               6207,3              40
+  0005                5,0              20173,5              12
+  0006                0,9              11376,6             401
+  0007              121,4                823,5               1
+  0008                2,5              40273,9            6000
+  0009                0,9              10695,2             298
+  0010                4,0              25025,0               1
+  0011                4,4              22655,2            1919
+  0012                3,5              28208,7            1919
+
+ +

Benchmark launch configuration

+ +

We can specify the Beam runner to use with maven profiles, available profiles are:

+ +
direct-runner
+spark-runner
+flink-runner
+apex-runner
+
+
+ +

The runner must also be specified like in any other Beam pipeline using:

+ +
--runner
+
+
+ +

Test data is deterministically synthesized on demand. The test +data may be synthesized in the same pipeline as the query itself, +or may be published to Pub/Sub.

+ +

The query results may be:

+ +
    +
  • Published to Pub/Sub.
  • +
  • Written to text files as plain text.
  • +
  • Written to text files using an Avro encoding.
  • +
  • Sent to BigQuery.
  • +
  • Discarded.
  • +
+ +

Common configuration parameters

+ +

Decide if batch or streaming:

+ +
--streaming=true
+
+
+ +

Number of events generators:

+ +
--numEventGenerators=4
+
+
+ +

Run query N:

+ +
--query=N
+
+
+ +

Available Suites

+

The suite to run can be chosen using this configuration parameter:

+ +
--suite=SUITE
+
+
+ +

Available suites are:

+
    +
  • DEFAULT: Test default configuration with query 0.
  • +
  • SMOKE: Run the 12 default configurations.
  • +
  • STRESS: Like smoke but for 1m events.
  • +
  • FULL_THROTTLE: Like SMOKE but 100m events.
  • +
+ +

Apex runner specific configuration

+ +
--manageResources=false --monitorJobs=false
+
+
+ +

Google Cloud Dataflow runner specific configuration

+ +
--manageResources=false --monitorJobs=true \
+--enforceEncodability=false --enforceImmutability=false
+--project=<your project> \
+--zone=<your zone> \
+--workerMachineType=n1-highmem-8 \
+--stagingLocation=<a gs path for staging> \
+--runner=DataflowRunner \
+--tempLocation=gs://talend-imejia/nexmark/temp/ \
+--stagingLocation=gs://talend-imejia/nexmark/temp/staging/ \
+--filesToStage=target/beam-sdks-java-nexmark-2.1.0-SNAPSHOT.jar
+
+
+ +

Direct runner specific configuration

+ +
--manageResources=false --monitorJobs=true \
+--enforceEncodability=false --enforceImmutability=false
+
+
+ + + +
--manageResources=false --monitorJobs=true \
+--flinkMaster=local --parallelism=#numcores
+
+
+ +

Spark runner specific configuration

+ +
--manageResources=false --monitorJobs=true \
+--sparkMaster=local \
+-Dspark.ui.enabled=false -DSPARK_LOCAL_IP=localhost -Dsun.io.serialization.extendedDebugInfo=true
+
+
+ +

Current status

+ +

These tables contain statuses of the queries runs in the different runners. Google Cloud Dataflow and Apache Gearpump statuses are yet to come.

+ +

Batch / Synthetic / Local

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
QueryDirectSparkFlinkApex
0okokokok
1okokokok
2okokokok
3okBEAM-1035okBEAM-1037
4okokokok
5okokokok
6okokokok
7okokokok
8okokokok
9okokokok
10okokokok
11okokokok
12okokokok
+ +

Streaming / Synthetic / Local

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
QueryDirectSpark BEAM-2847FlinkApex
0okokokok
1okokokok
2okokokok
3okBEAM-1035okBEAM-1037
4okokokok
5okokokok
6okokokok
7okBEAM-2112okok
8okokokok
9okokokok
10okokokok
11okokokok
12okokokok
+ +

Batch / Synthetic / Cluster

+ +

Yet to come

+ +

Streaming / Synthetic / Cluster

+ +

Yet to come

+ +

Running Nexmark

+ +

Running SMOKE suite on the DirectRunner (local)

+ +

Batch Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE --streaming=false --manageResources=false --monitorJobs=true --enforceEncodability=true --enforceImmutability=true"
+
+
+ +

Streaming Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE --streaming=true --manageResources=false --monitorJobs=true --enforceEncodability=true --enforceImmutability=true"
+
+
+ +

Running SMOKE suite on the SparkRunner (local)

+ +

Batch Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pspark-runner "-Dexec.args=--runner=SparkRunner --suite=SMOKE --streamTimeout=60 --streaming=false --manageResources=false --monitorJobs=true"
+
+
+ +

Streaming Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pspark-runner "-Dexec.args=--runner=SparkRunner --suite=SMOKE --streamTimeout=60 --streaming=true --manageResources=false --monitorJobs=true"
+
+
+ +

Running SMOKE suite on the FlinkRunner (local)

+ +

Batch Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pflink-runner "-Dexec.args=--runner=FlinkRunner --suite=SMOKE --streamTimeout=60 --streaming=false --manageResources=false --monitorJobs=true  --flinkMaster=local"
+
+
+ +

Streaming Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Pflink-runner "-Dexec.args=--runner=FlinkRunner --suite=SMOKE --streamTimeout=60 --streaming=true --manageResources=false --monitorJobs=true  --flinkMaster=local"
+
+
+ +

Running SMOKE suite on the ApexRunner (local)

+ +

Batch Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Papex-runner "-Dexec.args=--runner=ApexRunner --suite=SMOKE --streamTimeout=60 --streaming=false --manageResources=false --monitorJobs=false"
+
+
+ +

Streaming Mode:

+ +
mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main -Papex-runner "-Dexec.args=--runner=ApexRunner --suite=SMOKE --streamTimeout=60 --streaming=true --manageResources=false --monitorJobs=false"
+
+
+ +

Running SMOKE suite on Google Cloud Dataflow

+ +

Building package:

+ +
mvn clean package -Pdataflow-runner
+
+
+ +

Submit to Google Dataflow service:

+ +
java -cp sdks/java/nexmark/target/beam-sdks-java-nexmark-bundled-2.1.0-SNAPSHOT.jar \
+  org.apache.beam.sdk.nexmark.Main \
+  --runner=DataflowRunner
+  --project=<your project> \
+  --zone=<your zone> \
+  --workerMachineType=n1-highmem-8 \
+  --stagingLocation=<a gs path for staging> \
+  --streaming=true \
+  --sourceType=PUBSUB \
+  --pubSubMode=PUBLISH_ONLY \
+  --pubsubTopic=<an existing Pubsub topic> \
+  --resourceNameMode=VERBATIM \
+  --manageResources=false \
+  --monitorJobs=false \
+  --numEventGenerators=64 \
+  --numWorkers=16 \
+  --maxNumWorkers=16 \
+  --suite=SMOKE \
+  --firstEventRate=100000 \
+  --nextEventRate=100000 \
+  --ratePeriodSec=3600 \
+  --isRateLimited=true \
+  --avgPersonByteSize=500 \
+  --avgAuctionByteSize=500 \
+  --avgBidByteSize=500 \
+  --probDelayedEvent=0.000001 \
+  --occasionalDelaySec=3600 \
+  --numEvents=0 \
+  --useWallclockEventTime=true \
+  --usePubsubPublishTime=true \
+  --experiments=enable_custom_pubsub_sink
+
+
+ +
java -cp sdks/java/nexmark/target/beam-sdks-java-nexmark-bundled-2.1.0-SNAPSHOT.jar \
+  org.apache.beam.sdk.nexmark.Main \
+  --runner=DataflowRunner
+  --project=<your project> \
+  --zone=<your zone> \
+  --workerMachineType=n1-highmem-8 \
+  --stagingLocation=<a gs path for staging> \
+  --streaming=true \
+  --sourceType=PUBSUB \
+  --pubSubMode=SUBSCRIBE_ONLY \
+  --pubsubSubscription=<an existing Pubsub subscription to above topic> \
+  --resourceNameMode=VERBATIM \
+  --manageResources=false \
+  --monitorJobs=false \
+  --numWorkers=64 \
+  --maxNumWorkers=64 \
+  --suite=SMOKE \
+  --usePubsubPublishTime=true \
+  --outputPath=<a gs path under which log files will be written> \
+  --windowSizeSec=600 \
+  --occasionalDelaySec=3600 \
+  --maxLogEvents=10000 \
+  --experiments=enable_custom_pubsub_source
+
+
+ +

Running query 0 on a Spark cluster with Apache Hadoop YARN

+ +

Building package:

+ +
mvn clean package -Pspark-runner
+
+
+ +

Submit to the cluster:

+ +
spark-submit --master yarn-client --class org.apache.beam.sdk.nexmark.Main --driver-memory 512m --executor-memory 512m --executor-cores 1 beam-sdks-java-nexmark-bundled-2.1.0-SNAPSHOT.jar --runner=SparkRunner --query=0 --streamTimeout=60 --streaming=false --manageResources=false --monitorJobs=true
+
+
+ + +
+ + + + -- To stop receiving notification emails like this one, please contact "commits@beam.apache.org" .