Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3F722200B4C for ; Fri, 8 Jul 2016 05:57:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3E08C160A72; Fri, 8 Jul 2016 03:57:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 87F04160A68 for ; Fri, 8 Jul 2016 05:57:37 +0200 (CEST) Received: (qmail 77508 invoked by uid 500); 8 Jul 2016 03:57:35 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 77498 invoked by uid 99); 8 Jul 2016 03:57:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jul 2016 03:57:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3BD42C1D46 for ; Fri, 8 Jul 2016 03:57:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id M0xTEHkPI1RU for ; Fri, 8 Jul 2016 03:57:34 +0000 (UTC) Received: from mail-io0-f195.google.com (mail-io0-f195.google.com [209.85.223.195]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9F89C5FE66 for ; Fri, 8 Jul 2016 03:57:33 +0000 (UTC) Received: by mail-io0-f195.google.com with SMTP id l202so5036115ioe.3 for ; Thu, 07 Jul 2016 20:57:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=NrtX9O0JYwC2DMk3vuqDKVJiW7tbhDTOnHrjZMJprYg=; b=uDshntpFkMFKW4HLLSHbVjzMQGkYtgXrSoNwwz2gE/UNQ1j+ZGiSrWuZb2j5067sN/ 5Qlm6FbznqS2U7BlPdBtjtOPzXTjvgMb4LqjS+IYcQfD2EXL0fQsU+aSVKiu8SiUwyTW 0yg9WlQUoe45W9rkFhn/JmtReeyemJVNZU/isITh6OmEOtYAzOU7ofNUfQf+tWLvbeJJ +ekJlJ1TOwC0tUBk+jh4OJsQ5/PidNaxZZHykq5VKOKbyM8gr+nRrIBq6LyOR3zLHJbY q8yhmvWLjrlIRFy5AUfPn4XE4ZXw1giaKK97GGfFxSMhVkKiTFizzJ8uQLBWuzTzlX8T fBsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=NrtX9O0JYwC2DMk3vuqDKVJiW7tbhDTOnHrjZMJprYg=; b=VJKC3zckPR0JOFxHgR+OfRHte5fr7R6weh0wl5l2tl3I4U42UW2UtKsWBVdjwEaqTe pNaqt+g5JbJEzTEjG7qUqECFloiIAp9yv1REn2a0+VVysm8KVvSVDUDvZd6VOMAcm83D L8db4XYtBKIFE0wMlWs1B0tlN3sfc8ZE5RgsyWK+Y9mNd3WqNEcTEs+U22Pn04NbU8HO sKY6mFtH8ga1ciC4+Yx5vyfEpJgCUVe8g1eCXZOSGa3BowRaq/+iQztakc2QwGn0Mn0G q4sHAoRYJq4f/0LSq09tXnqrfj4lS5QtwTUgX/YeeiBxGNLthHOrKAHWkCN5XfLjPf4I W3MA== X-Gm-Message-State: ALyK8tJCn9Epzmt9065kmvwV4QT9eT2cKRlyMBW2Iyfc3arlWT+UiIm1q9u5G7tECQth7W7U+WdajsBFZmVeBQ== X-Received: by 10.107.1.198 with SMTP id 189mr6301986iob.74.1467950252182; Thu, 07 Jul 2016 20:57:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.11.5 with HTTP; Thu, 7 Jul 2016 20:57:31 -0700 (PDT) From: Saliya Ekanayake Date: Thu, 7 Jul 2016 23:57:31 -0400 Message-ID: Subject: Flink is Unstable when TM > 1 To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a11395e0c545195053717cd4d archived-at: Fri, 08 Jul 2016 03:57:38 -0000 --001a11395e0c545195053717cd4d Content-Type: text/plain; charset=UTF-8 Hi, I've been trying to run the provided KMeans example on a 16 node cluster. I was testing with 2 Task Managers (TM) per node because each node has 2 sockets (CPUs). A socket contains 12 cores, so I've set the number of slots per TM as 12.The total parallelism is 384 (12 slots x 2 TMs x 16 nodes). However, Flink TMs keep failing time to time causing KMeans to fail. The only explanation I could find from logs is that TMs unregister from Job Manager. I've increased Akka timeout to 1000s as well. Any suggestions on this? The data sizes I tried were 10k points, 250k points, and 1mil points. Number of centers were 100 to 1000. None of these sizes completed. Thank you, Saliya -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington --001a11395e0c545195053717cd4d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I've been trying to run the pro= vided KMeans example on a 16 node cluster. I was testing with 2 Task Manage= rs (TM) per node because each node has 2 sockets (CPUs). A socket contains = 12 cores, so I've set the number of slots per TM as 12.The total parall= elism is 384 (12 slots x 2 TMs x 16 nodes).=C2=A0

= However, Flink TMs keep failing time to time causing KMeans to fail. The on= ly explanation I could find from logs is that TMs unregister from Job Manag= er. I've increased Akka timeout to 1000s as well.=C2=A0

<= /div>
Any suggestions on this?

The data sizes = I tried were 10k points, 250k points, and 1mil points. Number of centers we= re 100 to 1000. None of these sizes completed.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Rese= arch Assistant
School of Informatics and Computing | = Digital Science Center
Indiana University, Bloomingto= n

--001a11395e0c545195053717cd4d--