Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 140359127 for ; Fri, 12 Dec 2014 05:18:49 +0000 (UTC) Received: (qmail 68585 invoked by uid 500); 12 Dec 2014 05:18:48 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 68524 invoked by uid 500); 12 Dec 2014 05:18:48 -0000 Mailing-List: contact dev-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.incubator.apache.org Delivered-To: mailing list dev@flink.incubator.apache.org Received: (qmail 68511 invoked by uid 99); 12 Dec 2014 05:18:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 05:18:47 +0000 X-ASF-Spam-Status: No, hits=0.6 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of henry.saputra@gmail.com designates 209.85.217.180 as permitted sender) Received: from [209.85.217.180] (HELO mail-lb0-f180.google.com) (209.85.217.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 05:18:23 +0000 Received: by mail-lb0-f180.google.com with SMTP id l4so5408028lbv.11 for ; Thu, 11 Dec 2014 21:18:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=LnOAifF8Ux+OnwivwbVW04VyVhQtpehA1ErFZYClHIo=; b=AgHdIqXU8VoVn2Lt3aPBRHcRqVQoS283oeDWgDhsLptOwPuVD/a3KnOfd01RHgqK30 Wf7c0zrtra/Z0EAPR8bKAHHLtPYmSevhSuf+3IXVdU5uS/DqgN+Qfs+5Q+6RXGgeK1I1 +pOGZC7052a8Kc4qLVLSwIMgcmhzkbww2yzxWCDAjgJ5XjiZhuHDZQYTSJQgNFCQUM2b yyjfYVrNLgV02Qul+qHy5j+yt/WL8m8L4IoxJu4xv98A5CNZYNdyNBOsBTYkJLPfVKF/ sut8avyYHDYvS6o6tjhAXcVftoXskqYOTc4ATY980FViNHJK2UWe6LuHzq2TJXrCWydH RbFw== MIME-Version: 1.0 X-Received: by 10.152.87.12 with SMTP id t12mr1211446laz.31.1418361502095; Thu, 11 Dec 2014 21:18:22 -0800 (PST) Received: by 10.25.31.66 with HTTP; Thu, 11 Dec 2014 21:18:22 -0800 (PST) In-Reply-To: References: <1417528437126-2713.post@n3.nabble.com> Date: Thu, 11 Dec 2014 21:18:22 -0800 Message-ID: Subject: Re: Enhance Flink's monitoring capabilities From: Henry Saputra To: "dev@flink.incubator.apache.org" Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Thanks Robert, looks like we could use this JIRA to do the work - Henry On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger wrote: > I think this (very old) issue is somewhat closely describing the feature: > https://issues.apache.org/jira/browse/FLINK-456 > > > > On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra > wrote: > >> Just curious, is there any JIRA filed for this or was it just in >> preliminary proposal talk? >> >> - Henry >> >> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen wrote: >> > That actually sounds like a great idea. I discussed a bit with Robert >> > offline on Friday, and it seems that Metrics has most of what we talked >> > about. >> > >> > I also like the way they make it extensible, so people can capture their >> > own metrics. >> > >> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra >> > wrote: >> > >> >> Hi Robert, >> >> >> >> From I have seen it so far, it is probably better and easier for Flink >> >> to leverage metrics library [1] for the metrics collection rather than >> >> building organically. >> >> >> >> Several ASF projects like Spark [2] and Tajo have used it with great >> >> success. >> >> >> >> One of the main reasons is maintainability and the breath of types of >> >> metric could and should be collected. >> >> >> >> - Henry >> >> >> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ >> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html >> >> [3] https://issues.apache.org/jira/browse/TAJO-333 >> >> >> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger >> >> wrote: >> >> > Hey Nils, >> >> > >> >> > I have played around a bit with a little prototype. You can find the >> code >> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its >> >> > another branch in my repo). >> >> > You can see the changes that I applied on top of Till's Akka branch >> here: >> >> > >> >> >> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 >> >> > >> >> > What the code does is collecting statistics about each TaskManager in >> the >> >> > system. These stats are assembled into a "MetricsReport" which is send >> >> with >> >> > the periodical heartbeat to the JobManager. The JobManager stores the >> >> > latest MetricsReport for each TaskManager (in the Instance object for >> >> each >> >> > TM). >> >> > When the user accesses the TaskManager overview, the latest >> MetricsReport >> >> > is send as a JSONObject to the browser. >> >> > >> >> > to test my changes, check out the code, build it >> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true >> >> > go into >> >> > cd >> >> > >> >> >> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ >> >> > and start the web interface >> >> > /bin/start-local.sh >> >> > >> >> > Go to localhost:8081, in the "TaskManager" view, you can see some >> >> metrics. >> >> > Here is a screenshot: http://img42.com/eNPve >> >> > >> >> > I named my branch after this issue, as it is probably describing best >> >> what >> >> > we're working on here: FLINK-456 >> >> > >> >> > >> >> > As I said in the beginning, its really just a prototype. Let me know >> if >> >> you >> >> > have any further questions. >> >> > For the "per TaskManager" reports, we should probably integrate some >> more >> >> > statistics. Also, the presentation of the numbers is very very basic >> >> right >> >> > now. I think there are many good libraries for visualizing these >> kinds of >> >> > stats. >> >> > Also, the numbers currently represent only a "snapshot", however, >> some of >> >> > the numbers can be accumulated (read/write bytes of the io manager). >> >> > Another missing feature is storing a little history of numbers to >> >> visualize >> >> > metrics over time. >> >> > >> >> > I'm trying to find time to look into "per job" metrics as well. They >> will >> >> > require a bit more infrastructure to distinguish them on the >> JobManager >> >> > side and to get them on the TaskManagers. >> >> > >> >> > >> >> > Best, >> >> > Robert >> >> > >> >> > >> >> > >> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < >> >> > alexander.s.alexandrov@gmail.com> wrote: >> >> > >> >> >> Hello Nils, >> >> >> >> >> >> I am going to work on a similar issue related to tracking some basics >> >> >> statistics of the intermediate results produced by dataflows during >> >> >> execution. >> >> >> >> >> >> I just create a Jira issue here: >> >> >> >> >> >> https://issues.apache.org/jira/browse/FLINK-1297 >> >> >> >> >> >> If you already have some work done on extending the monitoring >> >> capabilities >> >> >> in a branch, it might be good to sync-up the development in order to >> >> avoid >> >> >> duplicated work (e.g. using the same communication channel used to >> send >> >> the >> >> >> data from the task managers to the job manager). >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html >> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing >> >> list >> >> >> archive at Nabble.com. >> >> >> >> >> >>