Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3160518E2F for ; Wed, 15 Jul 2015 09:23:05 +0000 (UTC) Received: (qmail 31799 invoked by uid 500); 15 Jul 2015 09:23:05 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 31738 invoked by uid 500); 15 Jul 2015 09:23:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 31725 invoked by uid 99); 15 Jul 2015 09:23:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jul 2015 09:23:04 +0000 Date: Wed, 15 Jul 2015 09:23:04 +0000 (UTC) From: "Vasia Kalavri (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-2361) flatMap + distinct gives erroneous results for big data sets MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627773#comment-14627773 ] Vasia Kalavri commented on FLINK-2361: -------------------------------------- Hey, I have seen this exception before and I think it was when debugging FLINK-1930. I don't think this is an operator correctness problem either. [~andralungu], I take it vertex 657282846 actually exists in the dataset and it's not garbage, right? As far as I remember, what seems to be happening is that the vertex dataset has not been fully generated when the message to this vertex ID is sent, i.e. some kind of blocking issue. Can you check if this problem is still there when you generate the vertex set separately, store it to disk and get input from edges and vertex files? > flatMap + distinct gives erroneous results for big data sets > ------------------------------------------------------------ > > Key: FLINK-2361 > URL: https://issues.apache.org/jira/browse/FLINK-2361 > Project: Flink > Issue Type: Bug > Components: Gelly > Affects Versions: 0.10 > Reporter: Andra Lungu > > When running the simple Connected Components algorithm (currently in Gelly) on the twitter follower graph, with 1, 100 or 10000 iterations, I get the following error: > Caused by: java.lang.Exception: Target vertex '657282846' does not exist!. > at org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300) > at org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220) > at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) > at org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) > at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107) > at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:722) > Now this is very bizzare as the DataSet of vertices is produced from the DataSet of edges... Which means there cannot be a an edge with an invalid target id... The method calls flatMap to isolate the src and trg ids and distinct to ensure their uniqueness. > The algorithm works fine for smaller data sets... -- This message was sent by Atlassian JIRA (v6.3.4#6332)