Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hawq.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAOjayEevjqH3Um207H3OZ3Xuq57=mGKpeU1dkX82jsHsK5sfcA@mail.gmail.com>
References: 
 <CAOjayEevjqH3Um207H3OZ3Xuq57=mGKpeU1dkX82jsHsK5sfcA@mail.gmail.com>
From: Caleb Welton <cwelton@pivotal.io>
Date: Thu, 21 Jan 2016 15:26:51 -0800
Message-ID: 
 <CAOjayEfzP7LP_KEx7cDMi3PkXmDEtZkA84PfZj-_r0eM-y6MtA@mail.gmail.com>
Subject: Re: Debugging installcheck-good failures
To: dev@hawq.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a1149136a1d4a6b0529e071e1

--001a1149136a1d4a6b0529e071e1
Content-Type: text/plain; charset=UTF-8

Investigation the first failure:

test errortbl             ... FAILED (6.83 sec)


*What is this test verifying: * Error Table Support
*What went wrong: *Basic connectivity to external Tables
*Why did it go wrong: *The test was designed for a single node setup and
the test environment is multinode, combined with some recent changes to how
external tables execute.

*Background:*
HAWQ supports a robust external table feature to provide access to data
that is not managed directly by HAWQ.  One of the challenges when handling
external data is what do you do with badly formatted data?  The approach
taken in HAWQ is to:
- Error by default, but enable a mechanism to instead omit badly formatted
rows
- When omiting rows we log the badly formatted portions to an "Error Table"
for a user to review and potentially resolve through their own means.

Example statement using error tables:

> CREATE EXTERNAL TABLE EXT_NATION1 ( N_NATIONKEY  INTEGER ,
>                             N_NAME       CHAR(25) ,
>                             N_REGIONKEY  INTEGER ,
>                             N_COMMENT    VARCHAR(152))
> location ('gpfdist://localhost:7070/nation_error50.tbl')
> FORMAT 'text' (delimiter '|')
> LOG ERRORS INTO EXT_NATION_ERROR1 SEGMENT REJECT LIMIT 51;


The "LOG ERRORS INTO" statement being the critical one that is the focus of
this test suite.

*Debugging:*

First thing I did was change to the testing directory and compare the
outputs of the test to the expected outputs of the test:

bash$  cd src/test/regress
bash$  diff results/errortbl.out expected/errortbl.out

>From which the following can be found:

> ERROR:  connection with gpfdist failed for
> gpfdist://localhost:7070/nation_error50.tbl. effective url:
> http://127.0.0.1:7070/nation_error50.tbl. error code = 111 (C


So we can tell that the issue has to do with the connectivity to the
external tables.

This is an external table leveraging the gpfdist mechanism for loading
data.  gpfdist is a data loading mechanism that relies on a gpfdist daemon
running on the loading machine.

Issue #1:  the URL provided to the external table is 'localhost', this will
work fine in a single-node test environment, but since the hawq-devel
environment is a multinode configuration that 'localhost' will be evaluated
for every node that is trying to access data and it will not resolve to the
actual location of the data.

Attempt to fix #1 - change the URL from localhost -> centos7-namenode.  The
right way to handle this would be updating the macro handling in the
input/errortbl.source file so that @hostname@ translates to the correct
hostname rather than localhost, for my own debugging I simply hard codeded
it.

Result: if I hand start a gpfdist service on the namenode then everything
works correctly, but if I let the test framework start the gpfdist service
then things remain broken.

After scratching my head briefly my next thought was: what could be going
wrong with starting the gpfdist service?  On the surface this seems to be
working correctly, after all we see the following in the output file:

select * from gpfdist_start;
>       x
> -------------
>  starting...
> (1 row)
> select * from gpfdist_status;
>                                       x
>
>
> ------------------------------------------------------------------------------
>  Okay, gpfdist version "2.0.0.0_beta build dev" is running on
> localhost:7070.
> (1 row)


If we take a closer look at how exactly the test framework is
starting/stopping the gpfdist service we find the following:

CREATE EXTERNAL WEB TABLE gpfdist_status (x text)
> execute E'( python $GPHOME/bin/lib/gppinggpfdist.py localhost:7070 2>&1 ||
> echo) '
> on SEGMENT 0
> FORMAT 'text' (delimiter '|');
> CREATE EXTERNAL WEB TABLE gpfdist_start (x text)
> execute E'((/data/hawq-devel/bin/gpfdist -p 7070 -d
> /data/hawq/src/test/regress/data  </dev/null >/dev/null 2>&1 &); sleep 2;
> echo "starting...") '
> on SEGMENT 0
> FORMAT 'text' (delimiter '|');
> CREATE EXTERNAL WEB TABLE gpfdist_stop (x text)
> execute E'(/bin/pkill gpfdist || killall gpfdist) > /dev/null 2>&1; echo
> "stopping..."'
> on SEGMENT 0
> FORMAT 'text' (delimiter '|');


Here we are using a different type of external table, an "EXECUTE" table
and providing some command line options to start and stop the gpfdist
daemon.   It's a bit hacky, but it get's the job done.  Or rather it
should, and yet somehow a manually started gpfdist works and this doesn't,
so something else is going wrong.

Next step, investigate if the new external tables are executing on the
right segment (e.g. the master segment).

After creating the above external tables in a test database I ran:

bash$  psql -c "explain select * from gpfdist_stop"

>                                       QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------
>  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..11000.00
> rows=1000000 width=32)
>    ->  External Scan on gpfdist_stop  (cost=0.00..11000.00 rows=1000000
> width=32)
> (2 rows)


And here we see something that provides a crucial clue: if the external
table was running on the master node we would not expect to see a Gather
Motion, and yet we do which indicates that for some reason this external
table EXECUTE is running on the wrong node.  This explains why connections
back to "centos7-namenode" were not finding gpfdist running when going
through the test framework.

The other key piece of information that is needed here is that between Hawq
1.3 and Hawq 2.0 there were some major changes to the process architecture
and how segments get assigned.  This was foundational work both for our
Yarn Integration Support, but also for improvements with respect to
elasticity within the system.  And one of the areas that it impacted was
the handling of external tables.

With this last piece of information it became clear that the handling of
EXTERNAL EXECUTE tables with specific segment assignments was broken in
this merge.

In summary this test is failing for two reasons:
1. It was designed for a single node setup but is being run in multinode
and there are some test issues to fix.
2. A change was introduced which introduced a bug in External EXECUTE
tables.

Along the way we learned a little about external tables, error tables,
gpfdist, external execute, reading query plans, and briefly discussed the
multinode process model of HAWQ.

Please let me know if you have any additional questions related to any of
the above.


On Thu, Jan 21, 2016 at 3:07 PM, Caleb Welton <cwelton@pivotal.io> wrote:

> Dev community,
>
> One question that was asked at the last HAWQ's Nest community call was:
>
> "When something goes wrong with installcheck-good what are the common
> reasons?"
>
> As a followup from the call I ran through the excellent dev-environment
> guide that Zhanwei put together [1].  Along the way I ran into some issues
> and filed a jira [2] to g et those fixed.
>
> I thought providing a quick intro of how I looked at the failures
> encountered and how to diagnose them might be insightful to the wider
> community.
>
> Overall the current installcheck-good test suite will fail with 8 failures
> in the current dev environment.  I will walk through each of these failures
> in turn in subsequent emails.
>
> [1] https://hub.docker.com/r/mayjojo/hawq-devel/
> [2] https://issues.apache.org/jira/browse/HAWQ-358
>

--001a1149136a1d4a6b0529e071e1--