Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 30B25200B30 for ; Mon, 4 Jul 2016 12:34:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 28B2A160A65; Mon, 4 Jul 2016 10:34:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 48F71160A55 for ; Mon, 4 Jul 2016 12:34:14 +0200 (CEST) Received: (qmail 11906 invoked by uid 500); 4 Jul 2016 10:34:13 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 11896 invoked by uid 99); 4 Jul 2016 10:34:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2016 10:34:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E8306C8541 for ; Mon, 4 Jul 2016 10:34:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=okkam-it.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id l5Q_akzmrkly for ; Mon, 4 Jul 2016 10:34:10 +0000 (UTC) Received: from mail-lf0-f50.google.com (mail-lf0-f50.google.com [209.85.215.50]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4013760CD2 for ; Mon, 4 Jul 2016 10:34:10 +0000 (UTC) Received: by mail-lf0-f50.google.com with SMTP id q132so114460906lfe.3 for ; Mon, 04 Jul 2016 03:34:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=okkam-it.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=j0fqPd7/r2x5T/uYdkhqtmIh7nO1LDwohhG2s/27Cqk=; b=0RGzzpY70AvC5NOB9s0lholyjNDpPUQF1gTM8vlaF3t4qwOnUPjb59Pyp3hOkFIndr BWAu031ylux7CbHdFuSYqhMOWWQXKZmZ/pN+kSY/WU6yAyx5SP46gmT3mj21LM9R6MMG foGQZpJ+Njyua36KqNK3XdR7Agl/OZf7tJzIyJVJdJpIBY4jSZdAQ7Zt9P3tQQsUM1IT lxQeJAVYH2cXml7FxRSnLIQQFYpUSELgGzQEjW7cP7C76qBqZgMI1FzS0ybMv4z9yqr7 /YBfYiyK5Z9zY2GxSokvuuct3p6H4MjneuDJV+8CDLTTTYGAwb7OXNpNNNQeklwErs1g Lu+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=j0fqPd7/r2x5T/uYdkhqtmIh7nO1LDwohhG2s/27Cqk=; b=lgk2Dff9hre7+sOMCi70X4Z7p+gUVraMMoqHvw9CzO0JK1ha3DhNegz8wl8XYqWivL 6OglQp21ruwwAoCEFNPsdqUynK+DKYR5lwnmmEAQqLnVjahvY2kLv45rCelVmtNMomX+ Da9lc28Z0iyzCmckc2XGkha/eQJaTtHG7LJYUz73W9RPAiQBO9eeSWQ4N7QQiR1kfI7B HV1+g0L2T0RW8wMs/n4b0u2hDU9qqpSJLrgShxc5vZ2z7dgwb6nNNDvmg0/H1rWir6CG PytSjY2qm5mcAcSResUtWfZRMWyV7JxMBBBy2t0liRJF4WxiFifCkK/IKqVY44SrnGmY /AGA== X-Gm-Message-State: ALyK8tK0gxQ5PXvxmvOnKFOGS2AfmpUk+Se8ADEeEK+6+UzZ1qHo0GnyRIZU5Bf8z9OWlJ04j5bemnGlSMGF1Q== MIME-Version: 1.0 X-Received: by 10.46.71.206 with SMTP id u197mr2434164lja.16.1467628449486; Mon, 04 Jul 2016 03:34:09 -0700 (PDT) Received: by 10.114.74.198 with HTTP; Mon, 4 Jul 2016 03:34:09 -0700 (PDT) X-Originating-IP: [91.252.206.112] Received: by 10.114.74.198 with HTTP; Mon, 4 Jul 2016 03:34:09 -0700 (PDT) In-Reply-To: References: Date: Mon, 4 Jul 2016 12:34:09 +0200 Message-ID: Subject: Re: Different results on local and on cluster From: Flavio Pompermaier To: user Content-Type: multipart/alternative; boundary=001a114101d864eb1e0536cce047 archived-at: Mon, 04 Jul 2016 10:34:15 -0000 --001a114101d864eb1e0536cce047 Content-Type: text/plain; charset=UTF-8 Because I don't see any good reason for that...maybe also all keyo serialization errors that I have from time to time could be symptomatic of some other error in how Flink manage the ibternal buffers...but also this is just another personal guess I did.. On 4 Jul 2016 12:29 p.m., "Ufuk Celebi" wrote: > It's not possible to tell. You would have to look into the logs of the > job manager to check what happened. The not killed task manager could > have re-connected to the job manager, if it was restarted quickly > after the failure. Why do you think that the task manager would > influence the job result though? > > On Mon, Jul 4, 2016 at 12:23 PM, Flavio Pompermaier > wrote: > > No, I haven't. > > I fear that unkilled taskmanger could have been the cause of this > problem. > > Last day I run the job and I discovered that on some node there was some > > zombie taskmanger yhat wasn't terminated during the stop-cluster. > > What do you think?What happens in this situations?old taskmanager are > still > > avle to interfer with the new jobmanager? > > in the webdashboard I didn't see them so I thought it wasn't > problematic > > at all so I just killed them.. > > > > On 4 Jul 2016 12:07 p.m., "Ufuk Celebi" wrote: > > > > I guess Aljoscha was referring to whether you also have broadcasted > > input or something like it? > > > > On Fri, Jul 1, 2016 at 7:05 PM, Flavio Pompermaier > > > wrote: > >> what do you mean exactly? > >> > >> On 1 Jul 2016 18:58, "Aljoscha Krettek" wrote: > >>> > >>> Hi, > >>> do you have any data in the coGroup/groupBy operators that you use, > >>> besides the input data? > >>> > >>> Cheers, > >>> Aljoscha > >>> > >>> On Fri, 1 Jul 2016 at 14:17 Flavio Pompermaier > >>> wrote: > >>>> > >>>> Hi to all, > >>>> I have a Flink job that computes data correctly when launched locally > >>>> from my IDE while it doesn't when launched on the cluster. > >>>> > >>>> Is there any suggestion/example to understand the problematic > operators > >>>> in this way? > >>>> I think the root cause is the fact that some operator (e.g. > >>>> coGroup/groupBy,etc), which I assume to have all the data for a key, > >>>> maybe > >>>> it is not (because the data is partitioned among nodes). > >>>> > >>>> Any help is appreciated, > >>>> Flavio > --001a114101d864eb1e0536cce047 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Because I don't=C2=A0 see any good reason for that...may= be=C2=A0 also all keyo serialization errors that=C2=A0 I have from time to = time could be symptomatic of some other error in how Flink manage the ibter= nal buffers...but also this is just another personal guess I did..

On 4 Jul 2016 12:29 p.m., "Ufuk Celebi"= ; <uce@apache.org> wrote:
It's not possible to = tell. You would have to look into the logs of the
job manager to check what happened. The not killed task manager could
have re-connected to the job manager, if it was restarted quickly
after the failure. Why do you think that the task manager would
influence the job result though?

On Mon, Jul 4, 2016 at 12:23 PM, Flavio Pompermaier
<pompermaier@okkam.it> wr= ote:
> No, I haven't.
> I fear that unkilled taskmanger could have been the cause of this prob= lem.
> Last day I run the job and I discovered that on some node there was so= me
> zombie taskmanger yhat wasn't terminated during the stop-cluster.<= br> > What do you think?What happens in this situations?old taskmanager are = still
> avle to interfer with the new jobmanager?
> in the webdashboard I didn't=C2=A0 see them so I thought it wasn&#= 39;t=C2=A0 problematic
> at all so I just killed them..
>
> On 4 Jul 2016 12:07 p.m., "Ufuk Celebi" <uce@apache.org> wrote:
>
> I guess Aljoscha was referring to whether you also have broadcasted > input or something like it?
>
> On Fri, Jul 1, 2016 at 7:05 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>> what do you mean exactly?
>>
>> On 1 Jul 2016 18:58, "Aljoscha Krettek" <aljoscha@apache.org> wrote:
>>>
>>> Hi,
>>> do you have any data in the coGroup/groupBy operators that you= use,
>>> besides the input data?
>>>
>>> Cheers,
>>> Aljoscha
>>>
>>> On Fri, 1 Jul 2016 at 14:17 Flavio Pompermaier <pompermaier@okkam.it>
>>> wrote:
>>>>
>>>> Hi to all,
>>>> I have a Flink job that computes data correctly when launc= hed locally
>>>> from my IDE while it doesn't when launched on the clus= ter.
>>>>
>>>> Is there any suggestion/example to understand the problema= tic operators
>>>> in this way?
>>>> I think the root cause is the fact that some operator (e.g= .
>>>> coGroup/groupBy,etc), which I assume to have all the data = for a key,
>>>> maybe
>>>> it is not (because the data is partitioned among nodes). >>>>
>>>> Any help is appreciated,
>>>> Flavio
--001a114101d864eb1e0536cce047--