Return-Path: X-Original-To: apmail-asterixdb-users-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D80918E26 for ; Wed, 4 Nov 2015 19:30:26 +0000 (UTC) Received: (qmail 57855 invoked by uid 500); 4 Nov 2015 19:30:23 -0000 Delivered-To: apmail-asterixdb-users-archive@asterixdb.apache.org Received: (qmail 57824 invoked by uid 500); 4 Nov 2015 19:30:23 -0000 Mailing-List: contact users-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@asterixdb.incubator.apache.org Delivered-To: mailing list users@asterixdb.incubator.apache.org Received: (qmail 57814 invoked by uid 99); 4 Nov 2015 19:30:23 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Nov 2015 19:30:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9CF9B1A2DC0 for ; Wed, 4 Nov 2015 19:30:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 0d_NvEac1Rt8 for ; Wed, 4 Nov 2015 19:30:07 +0000 (UTC) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 944D720927 for ; Wed, 4 Nov 2015 19:30:07 +0000 (UTC) Received: by wmll128 with SMTP id l128so438172wml.0 for ; Wed, 04 Nov 2015 11:30:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Zuszg4VgQzXCTyuwKhBxpciaraBb5iGUK6OpUPhx6/Q=; b=iZ0puvar0I+8+4IwHCvnledEsbEOPFaO6bwOoTiWGJRG4DmSMZ4b/EC7y0LmccIdQr 4xwNLuYiq//XK57LxvghRfpSw0hkdlZhLxu6lryYiR1/6eenPDAy8FBBUIZhOzWEFXpz Icb0LFrUsVEMXJtRkJKd/U59en9XTEldimX0hIdTk5AyOaSuVZz0dEpGrrViuTNkYgGw DMijNqtjMCiZDTl1t3K5DgKQ6ljI/cSYziltEpYrDf8krpKQMw7spWz2JX1bXdzrbGk0 wUlBHZrsdml6QU4Hg9Z0Bj9F1y9JfqTt27wM9Hr9tmSAWi/JFkk1kWGTH6uBq0OMKTWb HlxQ== MIME-Version: 1.0 X-Received: by 10.28.22.203 with SMTP id 194mr5030882wmw.45.1446665407238; Wed, 04 Nov 2015 11:30:07 -0800 (PST) Received: by 10.27.172.145 with HTTP; Wed, 4 Nov 2015 11:30:07 -0800 (PST) In-Reply-To: <56846.141.20.24.156.1446665026.squirrel@www2.informatik.hu-berlin.de> References: <792997416F795F469B4A87A6FEC9F9766F550879@VTTMAIL3.ad.vtt.fi> <563118F6.2090207@gmail.com> <56846.141.20.24.156.1446665026.squirrel@www2.informatik.hu-berlin.de> Date: Wed, 4 Nov 2015 11:30:07 -0800 Message-ID: Subject: Re: Data in AsterixDB skewing towards one node From: Pouria Pirzadeh To: users@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1146ddf4b51b1a0523bc0978 --001a1146ddf4b51b1a0523bc0978 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Max, Can you please explain this part a bit more: "=E2=80=A6 When I load the external data it is all saved on a single node" Are you using "external datasets" or "internal datasets, loaded from files on HDFS". The fact is if you are using "external datasets", then AsterixDB does not really load any thing. It just gets the location of blocks on HDFS and remembers them. So in this case, if there is any issue with uniform distribution of data files, that is really related to HDFS and not AsterixDB. But if you are 'loading' an "internal" dataset by reading records from files on HDFS and you see issues with uniform distribution of created on-disk components, then that is another issue and could be related to AsterixDB. Pouria On Wed, Nov 4, 2015 at 11:23 AM, wrote: > Hello, > > I have a cluster setup of AsterixDB running 4 nodes with the first being > the master node and a node controller running on each of them. As a test = I > run TPC-H queries on them loading the generated TPC-H datasets from a > hadoop distributed file system. > > When I load the external data it is all saved on a single node. For later > querying that means that most of the computations are done by that single > node which slows down the whole query (and makes the distributed > computation idea obsolete). > > By now I tried to setup the system several times and interestingly enough > two times I was able to receive a fully functional system. Unfortunatly I > currently cannot reproduce a functional system state and whenever I try t= o > do a new setup I get the data skewing towards one node. > > Has that ever happened before? Do you know the reason for this or how to > fix that? > > Regards, Max > > --001a1146ddf4b51b1a0523bc0978 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Max,

Can you please explain this par= t a bit more:
"=E2=80=A6=C2=A0When I load the external data it is all saved on a single node"

Are you using "external datasets" or "i= nternal datasets, loaded from files on HDFS".
The fac= t is if you are using "external datasets", then AsterixDB does no= t really load any thing. It just gets the location of blocks on HDFS and re= members them. So in this case, if there is any issue with uniform distribut= ion of data files, that is really related to HDFS and not AsterixDB. But if= you are 'loading' an "internal" dataset by reading recor= ds from files on HDFS and you see issues with uniform distribution of creat= ed on-disk components, then that is another issue and could be related to A= sterixDB.

Pouria=C2=A0



On= Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-ber= lin.de> wrote:
Hello,

I have a cluster setup of AsterixDB running 4 nodes with the first being the master node and a node controller running on each of them. As a test I<= br> run TPC-H queries on them loading the generated TPC-H datasets from a
hadoop distributed file system.

When I load the external data it is all saved on a single node. For later querying that means that most of the computations are done by that single node which slows down the whole query (and makes the distributed
computation idea obsolete).

By now I tried to setup the system several times and interestingly enough two times I was able to receive a fully functional system. Unfortunatly I currently cannot reproduce a functional system state and whenever I try to<= br> do a new setup I get the data skewing towards one node.

Has that ever happened before? Do you know the reason for this or how to fix that?

Regards, Max


--001a1146ddf4b51b1a0523bc0978--