hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Caleb Welton <cwel...@pivotal.io>
Subject Re: Debugging installcheck-good failures
Date Thu, 21 Jan 2016 23:42:30 GMT
Investigating the second failure:

test subplan              ... FAILED (8.15 sec)

*What is this test verifying:  *Queries with a combination of InitPlan and
Subplan nodes, particuarly exercising aspects of the query
*What went wrong: *This test relies on plpython functions, but the plpython
language did not install successfully.
Why did it go wrong: plpython is an optional extension of the project and
must be enabled during ./configure when building the project


Focussing just on the part relevant to diagnosing the issue...

HAWQ supports user defined functions written in a variety of different
programming languages.  Support for additional programming languages is one
of the extensibility features of the product and can be accomplished by
writing specific Language Handler functions and associating those handle
functions with a language by way of the CREATE LANGUAGE statement.

There are several languages where the language handlers have already been
written, including plpython, plpgsql, and pljava.  The source code for all
of these can be found under src/pl

Of these three plpython and pljava are both considered <optional> languages
and are not built by default, whereas plpgsql is required and always built
with the server.


As before I began by changing to the test directory and looking at what

bash$  cd src/test/regress
bash$  diff results/subplan.out expected/subplan.out

And the result was we find this error:

ERROR: could not access file "$libdir/plpython": No such file or directory"

$libdir is the default location the system will look in to find library
files and plpython.so is the particular file that it is looking for.

We can check what the path of $libdir is by running:

bash$  pg_config  --libdir

And if we checked we would not find plpythonu.so, which is the source of
the problem.

Knowing that plpython is an optional package I went back to the root
directory, reran the configure command including --with-python and then
tried the test again.
bash$  ./configure --prefix=/data/hawq-devel --with-python
bash$  make -j 4
bash$  make install
bash$  sed 's|localhost|centos7-namenode|g' -i
bash$  echo 'centos7-datanode1' > /data/hawq-devel/etc/slaves
bash$  echo 'centos7-datanode2' >> /data/hawq-devel/etc/slaves
bash$  echo 'centos7-datanode3' >> /data/hawq-devel/etc/slaves
bash$  hawq restart
bash$  make installcheck-good

Result: problem was fixed.

So the issue here is that installcheck-good is expecting plpython to be
present.  We should either make plpython a required language or change the
tests to be smart about not relying on optional packages.

On Thu, Jan 21, 2016 at 3:26 PM, Caleb Welton <cwelton@pivotal.io> wrote:

> Investigation the first failure:
> test errortbl             ... FAILED (6.83 sec)
> *What is this test verifying: * Error Table Support
> *What went wrong: *Basic connectivity to external Tables
> *Why did it go wrong: *The test was designed for a single node setup and
> the test environment is multinode, combined with some recent changes to how
> external tables execute.
> *Background:*
> HAWQ supports a robust external table feature to provide access to data
> that is not managed directly by HAWQ.  One of the challenges when handling
> external data is what do you do with badly formatted data?  The approach
> taken in HAWQ is to:
> - Error by default, but enable a mechanism to instead omit badly formatted
> rows
> - When omiting rows we log the badly formatted portions to an "Error
> Table" for a user to review and potentially resolve through their own means.
> Example statement using error tables:
>>                             N_NAME       CHAR(25) ,
>>                             N_REGIONKEY  INTEGER ,
>>                             N_COMMENT    VARCHAR(152))
>> location ('gpfdist://localhost:7070/nation_error50.tbl')
>> FORMAT 'text' (delimiter '|')
> The "LOG ERRORS INTO" statement being the critical one that is the focus
> of this test suite.
> *Debugging:*
> First thing I did was change to the testing directory and compare the
> outputs of the test to the expected outputs of the test:
> bash$  cd src/test/regress
> bash$  diff results/errortbl.out expected/errortbl.out
> From which the following can be found:
>> ERROR:  connection with gpfdist failed for
>> gpfdist://localhost:7070/nation_error50.tbl. effective url:
>> error code = 111 (C
> So we can tell that the issue has to do with the connectivity to the
> external tables.
> This is an external table leveraging the gpfdist mechanism for loading
> data.  gpfdist is a data loading mechanism that relies on a gpfdist daemon
> running on the loading machine.
> Issue #1:  the URL provided to the external table is 'localhost', this
> will work fine in a single-node test environment, but since the hawq-devel
> environment is a multinode configuration that 'localhost' will be evaluated
> for every node that is trying to access data and it will not resolve to the
> actual location of the data.
> Attempt to fix #1 - change the URL from localhost -> centos7-namenode.
> The right way to handle this would be updating the macro handling in the
> input/errortbl.source file so that @hostname@ translates to the correct
> hostname rather than localhost, for my own debugging I simply hard codeded
> it.
> Result: if I hand start a gpfdist service on the namenode then everything
> works correctly, but if I let the test framework start the gpfdist service
> then things remain broken.
> After scratching my head briefly my next thought was: what could be going
> wrong with starting the gpfdist service?  On the surface this seems to be
> working correctly, after all we see the following in the output file:
> select * from gpfdist_start;
>>       x
>> -------------
>>  starting...
>> (1 row)
>> select * from gpfdist_status;
>>                                       x
>> ------------------------------------------------------------------------------
>>  Okay, gpfdist version " build dev" is running on
>> localhost:7070.
>> (1 row)
> If we take a closer look at how exactly the test framework is
> starting/stopping the gpfdist service we find the following:
> CREATE EXTERNAL WEB TABLE gpfdist_status (x text)
>> execute E'( python $GPHOME/bin/lib/gppinggpfdist.py localhost:7070 2>&1
>> || echo) '
>> on SEGMENT 0
>> FORMAT 'text' (delimiter '|');
>> CREATE EXTERNAL WEB TABLE gpfdist_start (x text)
>> execute E'((/data/hawq-devel/bin/gpfdist -p 7070 -d
>> /data/hawq/src/test/regress/data  </dev/null >/dev/null 2>&1 &);
sleep 2;
>> echo "starting...") '
>> on SEGMENT 0
>> FORMAT 'text' (delimiter '|');
>> CREATE EXTERNAL WEB TABLE gpfdist_stop (x text)
>> execute E'(/bin/pkill gpfdist || killall gpfdist) > /dev/null 2>&1; echo
>> "stopping..."'
>> on SEGMENT 0
>> FORMAT 'text' (delimiter '|');
> Here we are using a different type of external table, an "EXECUTE" table
> and providing some command line options to start and stop the gpfdist
> daemon.   It's a bit hacky, but it get's the job done.  Or rather it
> should, and yet somehow a manually started gpfdist works and this doesn't,
> so something else is going wrong.
> Next step, investigate if the new external tables are executing on the
> right segment (e.g. the master segment).
> After creating the above external tables in a test database I ran:
> bash$  psql -c "explain select * from gpfdist_stop"
>>                                       QUERY PLAN
>> ---------------------------------------------------------------------------------------
>>  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..11000.00
>> rows=1000000 width=32)
>>    ->  External Scan on gpfdist_stop  (cost=0.00..11000.00 rows=1000000
>> width=32)
>> (2 rows)
> And here we see something that provides a crucial clue: if the external
> table was running on the master node we would not expect to see a Gather
> Motion, and yet we do which indicates that for some reason this external
> table EXECUTE is running on the wrong node.  This explains why connections
> back to "centos7-namenode" were not finding gpfdist running when going
> through the test framework.
> The other key piece of information that is needed here is that between
> Hawq 1.3 and Hawq 2.0 there were some major changes to the process
> architecture and how segments get assigned.  This was foundational work
> both for our Yarn Integration Support, but also for improvements with
> respect to elasticity within the system.  And one of the areas that it
> impacted was the handling of external tables.
> With this last piece of information it became clear that the handling of
> EXTERNAL EXECUTE tables with specific segment assignments was broken in
> this merge.
> In summary this test is failing for two reasons:
> 1. It was designed for a single node setup but is being run in multinode
> and there are some test issues to fix.
> 2. A change was introduced which introduced a bug in External EXECUTE
> tables.
> Along the way we learned a little about external tables, error tables,
> gpfdist, external execute, reading query plans, and briefly discussed the
> multinode process model of HAWQ.
> Please let me know if you have any additional questions related to any of
> the above.
> On Thu, Jan 21, 2016 at 3:07 PM, Caleb Welton <cwelton@pivotal.io> wrote:
>> Dev community,
>> One question that was asked at the last HAWQ's Nest community call was:
>> "When something goes wrong with installcheck-good what are the common
>> reasons?"
>> As a followup from the call I ran through the excellent dev-environment
>> guide that Zhanwei put together [1].  Along the way I ran into some issues
>> and filed a jira [2] to g et those fixed.
>> I thought providing a quick intro of how I looked at the failures
>> encountered and how to diagnose them might be insightful to the wider
>> community.
>> Overall the current installcheck-good test suite will fail with 8
>> failures in the current dev environment.  I will walk through each of these
>> failures in turn in subsequent emails.
>> [1] https://hub.docker.com/r/mayjojo/hawq-devel/
>> [2] https://issues.apache.org/jira/browse/HAWQ-358

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message