Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6AADF200BB9 for ; Mon, 7 Nov 2016 10:02:27 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6908B160AF9; Mon, 7 Nov 2016 09:02:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 88733160AEB for ; Mon, 7 Nov 2016 10:02:26 +0100 (CET) Received: (qmail 50209 invoked by uid 500); 7 Nov 2016 09:02:25 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 50199 invoked by uid 99); 7 Nov 2016 09:02:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2016 09:02:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0EE9C18061A for ; Mon, 7 Nov 2016 09:02:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.931 X-Spam-Level: * X-Spam-Status: No, score=1.931 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 99G-q2ScxqLd for ; Mon, 7 Nov 2016 09:02:22 +0000 (UTC) Received: from mail-qk0-f173.google.com (mail-qk0-f173.google.com [209.85.220.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A25335F19D for ; Mon, 7 Nov 2016 09:02:22 +0000 (UTC) Received: by mail-qk0-f173.google.com with SMTP id n21so57094566qka.3 for ; Mon, 07 Nov 2016 01:02:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=jO3hfeimQXoTJKp6Bz0klklCLMrnMAIyM6NuomBPdxI=; b=UqYASwCna5a0b5+O8hD0NYr1dUrmsqs9Tp/bsjCChtFpUBX/2iWnzmprm6tTdR9bfb l1WZjJbyWGrFmP/0P9p+vq5qmPNOW9KzZN2bV06iKx3nIWEjzk4jWUWKC+VHOiA6Pf86 wx3WuGgHOP1EyTQCQ0cFa6iOSlyCGTIC4Hjg/UUaf+xdWGfFGLwKZ+UyQLNeWu+V3LyG RIgGLL6gSbGBI5xTHJt9tlKQATfbbyoNyQq5r1Wsb2tIw8XV15ttIfRuA1y+GSQpoOW+ q03d/9wO7vJK+zNS8pwmjyG4qdEKxKnlDrVhUhPsfWUMTz+64CTN6aZ3QzmMK8f1Ac3v M/dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=jO3hfeimQXoTJKp6Bz0klklCLMrnMAIyM6NuomBPdxI=; b=fn++bFhvxKFQoZJRvCRDeiiqb9VOlZ9R/GdWV1GntlvGUIErxgFDczrW4k0RAXDdoN K2ciT9xjias+3PNL8H/qqdcBYC2c7oc8Yt/c6CeqmDONNchsLGdyo4au2NMwLaMh3TNU HZJM3gnIWE7ov2t4Z07BlZEoCCNkvW2qGYYnZpNA40kdHeDhYo41agTtoDr2F45+xeDQ 9UFVonApk5x279LfTgebtzMGAlWqec4UqpNhLSJGx4aGmoauIRd4KtVQg03GmRTDjcmc IqvwvBIbEQwk5tuBG27nNui0l0AoL20WdVOKuKv0xkn0MoRM12/PRjko5lKaQfSwijRH Cu1w== X-Gm-Message-State: ABUngveQjTBUAre0Emfjwn9IZivCxi1Oh+1JL/4wuEgAWVGabFQ/Cjj+B2B1g1meWBUtRrDS8MhWUEAWkDoA2A== X-Received: by 10.55.69.73 with SMTP id s70mr6050299qka.161.1478509342003; Mon, 07 Nov 2016 01:02:22 -0800 (PST) MIME-Version: 1.0 Received: by 10.200.47.48 with HTTP; Mon, 7 Nov 2016 01:02:21 -0800 (PST) From: Si-li Liu Date: Mon, 7 Nov 2016 17:02:21 +0800 Message-ID: Subject: Flink failed when can not connect to BlobServer To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a114ac30a2096ba0540b2485e archived-at: Mon, 07 Nov 2016 09:02:27 -0000 --001a114ac30a2096ba0540b2485e Content-Type: text/plain; charset=UTF-8 Hi, all I use Flink DataSet API to do some batch job, read some log then group and sort them. Our cluster has almost 2000 servers, we get used to use traditional MR job, then I tried Flink to do some experiment job, but I counter this error and can not continue, does anyone can help with it? Our MR jobs also counter such connection error sometimes, but it will retry serval times then get success. It seems that the whole calculation process failed when one single task failed in Flink. java.io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec182b916f7962 at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:262) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:116) at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:721) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb138c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b79e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5fbbb138c02ec182b916f7962 at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:253) ... 4 more Caused by: java.io.IOException: Could not connect to BlobServer at address /10.132.99.150:42927 at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:88) at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:124) ... 5 more Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:84) ... 6 more -- Best regards Sili Liu --001a114ac30a2096ba0540b2485e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi, all

I use Flink DataSe= t API to do some batch job, read some log then group and sort them. Our clu= ster has almost 2000 servers, we get used to use traditional MR job, then I= tried Flink to do some experiment job, but I counter this error and can no= t continue, does anyone can help with it?

Our MR jobs also counter such con= nection error sometimes, but it will retry serval times then get success. I= t seems that the whole calculation process failed when one single task fail= ed in Flink.

java.=
io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec18=
2b916f7962
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCa=
cheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheMa=
nager.java:262)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCa=
cheManager.registerTask(BlobLibraryCacheManager.java:116)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClass=
loader(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb13=
8c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b7=
9e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5=
fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java=
:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCa=
cheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheMa=
nager.java:253)
	... 4 more
Caused by: java.io.IOException: Could not connect to BlobServer at address =
/10.132.99.150:42=
927
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.<=
wbr>java:88)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java=
:124)
	... 5 more
Caused by: java.net.ConnectException: Connection timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.<=
wbr>doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.<=
wbr>connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.<=
wbr>connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.<=
wbr>java:84)
	... 6 more

--
Best regards

Sili Liu
--001a114ac30a2096ba0540b2485e--