Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E8111192EF for ; Mon, 25 Apr 2016 09:27:30 +0000 (UTC) Received: (qmail 73700 invoked by uid 500); 25 Apr 2016 09:27:30 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 73635 invoked by uid 500); 25 Apr 2016 09:27:30 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 73624 invoked by uid 99); 25 Apr 2016 09:27:30 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Apr 2016 09:27:30 +0000 Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 19D921A00E0 for ; Mon, 25 Apr 2016 09:27:30 +0000 (UTC) Received: by mail-wm0-f42.google.com with SMTP id e201so78667327wme.0 for ; Mon, 25 Apr 2016 02:27:30 -0700 (PDT) X-Gm-Message-State: AOPr4FWwaDieebiSeOQCMj2HGnDMJbj1gc4pf822k1jl6ZdEMGhezIJvW/6hcxi7Cjux1QFPbLJeWCBSNYp9sQ== MIME-Version: 1.0 X-Received: by 10.194.116.9 with SMTP id js9mr38814904wjb.112.1461576448680; Mon, 25 Apr 2016 02:27:28 -0700 (PDT) Received: by 10.194.42.194 with HTTP; Mon, 25 Apr 2016 02:27:28 -0700 (PDT) In-Reply-To: References: Date: Mon, 25 Apr 2016 11:27:28 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Partition problem From: Till Rohrmann To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a1130d2180933c905314bc9c9 --001a1130d2180933c905314bc9c9 Content-Type: text/plain; charset=UTF-8 Hi Andrew, I think the problem is that you assume that both matrices have the same partitioning. If you guarantee that this is the case, then you can use the subtask index as the block index. But in the general case this is not true, and then you have to calculate the blocks by first assigning a block index (e.g. rows with 0-9 index are assigned to block 0, rows with 10-19 assigned to block 1, etc.) and then create the blocks by reducing on this block index. That's because the distribution of the individual rows in the cluster is not necessarily the same between two matrices. Cheers, Till On Mon, Apr 25, 2016 at 1:40 AM, Andrew Palumbo wrote: > Hi All, > > > I've run into a problem with empty partitions when the number of elements > in a DataSet is less than the Degree of Parallelism. I've created a gist > here to describe it: > > > https://gist.github.com/andrewpalumbo/1768dac6d2f5fdf963abacabd859aaf3 > > > I have two 2x2 matrices, Matrix A and Matrix B and an execution > environment where the degree of parallelism is 4. Both matrices are > blockified in 2 different DataSet s . In this case (the case of a 2x2 > matrices with 4 partitions) this means that each row goes into a partition > leaving 2 empty partitions. In Matrix A, the rows go into partitions 0, 1. > However the rows of Matrix B end up in partitions 1, 2. I assign the > ordinal index of the blockified matrix's partition to its block, and then > join on that index. > > > However in this case, with differently partitioned matrices of the same > geometry, the intersection of the blockified matrices' indices is 1, and > partitions 0 and 2 are dropped. > > > I've tried explicitly defining the dop for Matrix B using the count of > non-empty partitions in Matrix A, however this changes the order of the > DataSet, placing partition 2 into partition 0. > > > Is there a way to make sure that these datasets are partitioned in the > same way? > > > Thank you, > > > Andy > > > --001a1130d2180933c905314bc9c9--