Return-Path: X-Original-To: apmail-hawq-dev-archive@minotaur.apache.org Delivered-To: apmail-hawq-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EEB281859B for ; Thu, 21 Jan 2016 23:27:23 +0000 (UTC) Received: (qmail 73079 invoked by uid 500); 21 Jan 2016 23:27:23 -0000 Delivered-To: apmail-hawq-dev-archive@hawq.apache.org Received: (qmail 73026 invoked by uid 500); 21 Jan 2016 23:27:23 -0000 Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list dev@hawq.incubator.apache.org Received: (qmail 73014 invoked by uid 99); 21 Jan 2016 23:27:23 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jan 2016 23:27:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 28E77C19D4 for ; Thu, 21 Jan 2016 23:27:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.001 X-Spam-Level: ***** X-Spam-Status: No, score=5.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, KAM_BADIPHTTP=2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6sYZWhHS7s7S for ; Thu, 21 Jan 2016 23:27:12 +0000 (UTC) Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D2D0623016 for ; Thu, 21 Jan 2016 23:27:11 +0000 (UTC) Received: by mail-io0-f182.google.com with SMTP id 1so71183303ion.1 for ; Thu, 21 Jan 2016 15:27:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=o3vahERulEbl9WcQN1Fxxt2jOumUw+ZTQmSsXEa9t68=; b=Ms2Wel3aqk7teb4GsPHmY0UCDR6cBE1FiuMpzfErOCoE0L5GaUE/YXm2p3nXZrjX0s 6oT44BTrXjXsLPcpCuhFZ0+1Za6u7RQJjic0YWfUDerIsdq3PDKhNe9bcVKod5OCM08G qnjGcTnrEAeKIbxBpqwn0KHCBN3wXS1vhUqYxYh7ied7pUvGvxwO+JtjjbnkQnXjNz8U OUPUQA1Qf5OvcjDQS5DoACJm/SvbPANNKcpnWxxLAGRmZ+mvJiOu3uDg3KQyVVViWHhH i+4dUeOsj3cAtw2YNwmQj9uEX0Z1TEwI/vOq1FuRunOp+CnRMGOwgYbUqQyg5ANjUPt+ Ss7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=o3vahERulEbl9WcQN1Fxxt2jOumUw+ZTQmSsXEa9t68=; b=errSHVTyux9Qr4w/QCa8aBl+eBaPwsmIA7XdLdUa2j6sPb0hfjhvwg2Vui8g4PG5uU V+oxDZ9VozyQT26Ybzno4JUhKetKZN5mn51UBj2wiNL/lFDjKnzZWexK/aKqlLvjysKe 2yfzmQfIipT22WBLnV3c8EQA+u0nNDmSQ7E454w9tAvcZkKxnHwjYcOgXweKysym8Wmw MUZmpm+uawO5LRnBYV/XngFbVv47S0Bkda7b1KZsmX9v/MMNf8Vr80MPVlbcoKEUl4N/ gdstxWicWRQDyoNuilF6+/YAZfh1j+4fUxD5hLouFOYKE9GNVVehnjJdZ4I0H0zBkXTn xBDg== X-Gm-Message-State: AG10YOTzSazTw7E9TyiDIjqYxr4G3MR4jKTy+VrIyjXWgncnn2VBAI0Bcs/rHZqWZ6xA38Nhc5FbauSWcic+i+ul X-Received: by 10.107.132.97 with SMTP id g94mr540432iod.159.1453418830701; Thu, 21 Jan 2016 15:27:10 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.7.38 with HTTP; Thu, 21 Jan 2016 15:26:51 -0800 (PST) In-Reply-To: References: From: Caleb Welton Date: Thu, 21 Jan 2016 15:26:51 -0800 Message-ID: Subject: Re: Debugging installcheck-good failures To: dev@hawq.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1149136a1d4a6b0529e071e1 --001a1149136a1d4a6b0529e071e1 Content-Type: text/plain; charset=UTF-8 Investigation the first failure: test errortbl ... FAILED (6.83 sec) *What is this test verifying: * Error Table Support *What went wrong: *Basic connectivity to external Tables *Why did it go wrong: *The test was designed for a single node setup and the test environment is multinode, combined with some recent changes to how external tables execute. *Background:* HAWQ supports a robust external table feature to provide access to data that is not managed directly by HAWQ. One of the challenges when handling external data is what do you do with badly formatted data? The approach taken in HAWQ is to: - Error by default, but enable a mechanism to instead omit badly formatted rows - When omiting rows we log the badly formatted portions to an "Error Table" for a user to review and potentially resolve through their own means. Example statement using error tables: > CREATE EXTERNAL TABLE EXT_NATION1 ( N_NATIONKEY INTEGER , > N_NAME CHAR(25) , > N_REGIONKEY INTEGER , > N_COMMENT VARCHAR(152)) > location ('gpfdist://localhost:7070/nation_error50.tbl') > FORMAT 'text' (delimiter '|') > LOG ERRORS INTO EXT_NATION_ERROR1 SEGMENT REJECT LIMIT 51; The "LOG ERRORS INTO" statement being the critical one that is the focus of this test suite. *Debugging:* First thing I did was change to the testing directory and compare the outputs of the test to the expected outputs of the test: bash$ cd src/test/regress bash$ diff results/errortbl.out expected/errortbl.out >From which the following can be found: > ERROR: connection with gpfdist failed for > gpfdist://localhost:7070/nation_error50.tbl. effective url: > http://127.0.0.1:7070/nation_error50.tbl. error code = 111 (C So we can tell that the issue has to do with the connectivity to the external tables. This is an external table leveraging the gpfdist mechanism for loading data. gpfdist is a data loading mechanism that relies on a gpfdist daemon running on the loading machine. Issue #1: the URL provided to the external table is 'localhost', this will work fine in a single-node test environment, but since the hawq-devel environment is a multinode configuration that 'localhost' will be evaluated for every node that is trying to access data and it will not resolve to the actual location of the data. Attempt to fix #1 - change the URL from localhost -> centos7-namenode. The right way to handle this would be updating the macro handling in the input/errortbl.source file so that @hostname@ translates to the correct hostname rather than localhost, for my own debugging I simply hard codeded it. Result: if I hand start a gpfdist service on the namenode then everything works correctly, but if I let the test framework start the gpfdist service then things remain broken. After scratching my head briefly my next thought was: what could be going wrong with starting the gpfdist service? On the surface this seems to be working correctly, after all we see the following in the output file: select * from gpfdist_start; > x > ------------- > starting... > (1 row) > select * from gpfdist_status; > x > > > ------------------------------------------------------------------------------ > Okay, gpfdist version "2.0.0.0_beta build dev" is running on > localhost:7070. > (1 row) If we take a closer look at how exactly the test framework is starting/stopping the gpfdist service we find the following: CREATE EXTERNAL WEB TABLE gpfdist_status (x text) > execute E'( python $GPHOME/bin/lib/gppinggpfdist.py localhost:7070 2>&1 || > echo) ' > on SEGMENT 0 > FORMAT 'text' (delimiter '|'); > CREATE EXTERNAL WEB TABLE gpfdist_start (x text) > execute E'((/data/hawq-devel/bin/gpfdist -p 7070 -d > /data/hawq/src/test/regress/data /dev/null 2>&1 &); sleep 2; > echo "starting...") ' > on SEGMENT 0 > FORMAT 'text' (delimiter '|'); > CREATE EXTERNAL WEB TABLE gpfdist_stop (x text) > execute E'(/bin/pkill gpfdist || killall gpfdist) > /dev/null 2>&1; echo > "stopping..."' > on SEGMENT 0 > FORMAT 'text' (delimiter '|'); Here we are using a different type of external table, an "EXECUTE" table and providing some command line options to start and stop the gpfdist daemon. It's a bit hacky, but it get's the job done. Or rather it should, and yet somehow a manually started gpfdist works and this doesn't, so something else is going wrong. Next step, investigate if the new external tables are executing on the right segment (e.g. the master segment). After creating the above external tables in a test database I ran: bash$ psql -c "explain select * from gpfdist_stop" > QUERY PLAN > > > --------------------------------------------------------------------------------------- > Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..11000.00 > rows=1000000 width=32) > -> External Scan on gpfdist_stop (cost=0.00..11000.00 rows=1000000 > width=32) > (2 rows) And here we see something that provides a crucial clue: if the external table was running on the master node we would not expect to see a Gather Motion, and yet we do which indicates that for some reason this external table EXECUTE is running on the wrong node. This explains why connections back to "centos7-namenode" were not finding gpfdist running when going through the test framework. The other key piece of information that is needed here is that between Hawq 1.3 and Hawq 2.0 there were some major changes to the process architecture and how segments get assigned. This was foundational work both for our Yarn Integration Support, but also for improvements with respect to elasticity within the system. And one of the areas that it impacted was the handling of external tables. With this last piece of information it became clear that the handling of EXTERNAL EXECUTE tables with specific segment assignments was broken in this merge. In summary this test is failing for two reasons: 1. It was designed for a single node setup but is being run in multinode and there are some test issues to fix. 2. A change was introduced which introduced a bug in External EXECUTE tables. Along the way we learned a little about external tables, error tables, gpfdist, external execute, reading query plans, and briefly discussed the multinode process model of HAWQ. Please let me know if you have any additional questions related to any of the above. On Thu, Jan 21, 2016 at 3:07 PM, Caleb Welton wrote: > Dev community, > > One question that was asked at the last HAWQ's Nest community call was: > > "When something goes wrong with installcheck-good what are the common > reasons?" > > As a followup from the call I ran through the excellent dev-environment > guide that Zhanwei put together [1]. Along the way I ran into some issues > and filed a jira [2] to g et those fixed. > > I thought providing a quick intro of how I looked at the failures > encountered and how to diagnose them might be insightful to the wider > community. > > Overall the current installcheck-good test suite will fail with 8 failures > in the current dev environment. I will walk through each of these failures > in turn in subsequent emails. > > [1] https://hub.docker.com/r/mayjojo/hawq-devel/ > [2] https://issues.apache.org/jira/browse/HAWQ-358 > --001a1149136a1d4a6b0529e071e1--