hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Caleb Welton <cwel...@pivotal.io>
Subject Re: Debugging installcheck-good failures
Date Fri, 22 Jan 2016 01:29:40 GMT
The last three failures:

test copy                 ... FAILED (34.76 sec)
test set_functions        ... FAILED (4.90 sec)
test exttab1              ... FAILED (17.66 sec)


*What are these tests verifying:  *
1. The COPY command which is another data loading/unloading mechanism
inherited from Postgres.
2. User defined functions that return multiple rows, e.g. "set functions".
3. External tables
*What went wrong: *
1. the test is sensitive to output ordering and so will intermittently error
2. PLpython didn't load
3. Issues setting up gpfdist
Why did it go wrong:
Copy failed because the rows came back in a different order than the test
was expecting, which is not necessarily unexpected.  Within a distributed
system order of returned results is not guaranteed unless the query has an
explicit ORDER BY and the master node itself is handling the ordered merge
of the results it receives to place it into its final ordering.

The second and third are basically repeats of issues we have already seen,
so I won't spend any additional time on them.

*Debugging:*

There are actually two diffs in the copy output if we look at diff results
one of which is a NOTICE message in the output, but not the expected.
Currently the test framework ignores all NOTICE messages in the output, so
this doesn't cause any issues.  The second diff may or may not happen if
you run the test repeatedly due to the nature of ordered results issues.

Basically we got:

> 2,b,2

1,a,1


instead of

> 1,a,1
> 2,b,2


Looking at the statement:

> copy copytest3 to stdout csv header;


We can see that no explicit ORDER BY is specified, and indeed is not
possible to specify within a COPY statement.

To fix this conclusively we would have to update the test to indicate that
we are not concerned with the output order of the results.  The easiest way
to handle this would be to use some of the existing tools and test
directives used by other tests.  In particular there is a script called
atmsort.pl that is designed to act as a preprocessing prior to going into a
diff and it supports tools to omit sections of output, ensure that two
queries produce equivalent output, resort data, etc.

You can see some of these atmsort.pl directives in some of the tests we've
already looked at, e.g. the '-- start_ignore' and '-- end_ignore'
directives in the exttab1 test show atmsort usage.  Similarily you can see
some '-- order' directives in the sql/olap_window.sql test.  We would need
a similar directive here to ensure that the test produces a stable output
for a potentially unstable ordering returning from copy.

In this section we learned about atmsort.pl and how we can use it to filter
and handle preprocessing of test output.

And with that I'm done with this little mini-series on debugging
installcheck-good failures.


On Thu, Jan 21, 2016 at 4:54 PM, Caleb Welton <cwelton@pivotal.io> wrote:

> Investigating the third failure:
>
> test create_table_distribution ... FAILED (3.47 sec)
>
>
> *What is this test verifying:  *This test is creating tables with
> different distribution policies and verifying that tables are handling
> different hashing/distribution schemes correctly.
> *What went wrong: *This test makes assumptions about what the topology of
> the cluster is that are not correct within the devel environment.
> Why did it go wrong: This is a specific error with the test framework
> relative to the test environment.
>
> *Background*:
>
> HAWQ supports several "distribution" policies for tables including:
> - RANDOMLY distributed
> - HASH partitioned
>
> For RANDOMLY distributed tables all the data gets placed into a single
> file structure and blocks are served out to different segments at execution
> time based on the HAWQ resource manager's block allocations.
>
> For HASH distributed tables when data is loaded it will be hashed into a
> given number of buckets where each bucket is a separate file structure in
> HDFS.  In the most typical situations the resource manager will ensure that
> it allocates one segment per hash bucket and each segment will be
> responsible for the data under that HASH.
>
> Historically the number of hash buckets was equal to the number of
> segments defined within the HAWQ cluster.  With the new HAWQ 2.0
> architecture and our support for dynamic cluster elasticity this is no
> longer a requirement and we now support specifying the number of hash
> buckets for a table at table creation time.  When no specific number of
> buckets has been specified then the system will assign a default number of
> buckets based on the current configuration settings for any HASH
> distributed table.
>
> To create a RANDOMLY distributed table the syntax is:
>
> CREATE TABLE x(...) DISTRIBUTED RANDOMLY;
>
> To create a HASH distributed table the syntax is:
>
> CREATE TABLE x(...)
>    [ WITH (bucketnum = <number>) ]
>    DISTRIBUTED BY (<column_list>)
>
> If no explicit distribution policy is specified in the CREATE TABLE
> statement then the system will default to the first hashable column in the
> table as a HASH distributed key if possible, and use RANDOM distribution if
> there is no hashable column in the table.
>
> The catalog table gp_distribution_policy contains the metadata information
> describing what the distribution.
>
> *Debugging:*
>
> As before I began by changing to the test directory and looking at what
> failed:
>
> bash$  cd src/test/regress
> bash$  diff results/create_table_distribution.out
> expected/create_table_distribution.out
>
> And this reveals that the catalog queries looking at how the tables have
> been defined are returning a different number of buckets than expected.
> More specifically, this occurs only for the tables without an explicit
> number of buckets specified.
>
> For example the statement:
>
> CREATE TABLE t1(c1 int);
>
>
> Resulted in 2 buckets instead of 8, but the statement:
>
> CREATE TABLE t1_1_w(c2 int) INHERITS(t1) WITH (bucketnum = 3)
>
>
> Resulted correctly in 3 buckets.
>
> As the default bucketing mechanism is recently changed to to support the
> new elasticity this is a new test, and it has clearly not been setup to
> expect the particular configuration that the cluster has within the devel
> environment for the default #/hash buckets.
>
> Result: This test needs to be made to work for arbitrary cluster
> configurations.  In particular it may be helpful if the test itself sets
> the default number of hash buckets so that the output is consistent.
>
> In this example we learned a bit about the new storage model as it relates
> to hash and randomly distributed tables, how that information is
> represented within the catalog, as well as how to specify distribution
> policies on table create.
>
> On Thu, Jan 21, 2016 at 3:42 PM, Caleb Welton <cwelton@pivotal.io> wrote:
>
>> Investigating the second failure:
>>
>> test subplan              ... FAILED (8.15 sec)
>>
>>
>> *What is this test verifying:  *Queries with a combination of InitPlan
>> and Subplan nodes, particuarly exercising aspects of the query
>> planner/optimizer.
>> *What went wrong: *This test relies on plpython functions, but the
>> plpython language did not install successfully.
>> Why did it go wrong: plpython is an optional extension of the project
>> and must be enabled during ./configure when building the project
>>
>> *Background*:
>>
>> Focussing just on the part relevant to diagnosing the issue...
>>
>> HAWQ supports user defined functions written in a variety of different
>> programming languages.  Support for additional programming languages is one
>> of the extensibility features of the product and can be accomplished by
>> writing specific Language Handler functions and associating those handle
>> functions with a language by way of the CREATE LANGUAGE statement.
>>
>> There are several languages where the language handlers have already been
>> written, including plpython, plpgsql, and pljava.  The source code for all
>> of these can be found under src/pl
>>
>> Of these three plpython and pljava are both considered <optional>
>> languages and are not built by default, whereas plpgsql is required and
>> always built with the server.
>>
>> *Debugging:*
>>
>> As before I began by changing to the test directory and looking at what
>> failed:
>>
>> bash$  cd src/test/regress
>> bash$  diff results/subplan.out expected/subplan.out
>>
>> And the result was we find this error:
>>
>> ERROR: could not access file "$libdir/plpython": No such file or
>>> directory"
>>
>>
>> $libdir is the default location the system will look in to find library
>> files and plpython.so is the particular file that it is looking for.
>>
>> We can check what the path of $libdir is by running:
>>
>> bash$  pg_config  --libdir
>> /data/hawq-devel/lib
>>
>> And if we checked we would not find plpythonu.so, which is the source of
>> the problem.
>>
>> Knowing that plpython is an optional package I went back to the root
>> directory, reran the configure command including --with-python and then
>> tried the test again.
>> bash$  ./configure --prefix=/data/hawq-devel --with-python
>> bash$  make -j 4
>> bash$  make install
>> bash$  sed 's|localhost|centos7-namenode|g' -i
>> /data/hawq-devel/etc/hawq-site.xml
>> bash$  echo 'centos7-datanode1' > /data/hawq-devel/etc/slaves
>> bash$  echo 'centos7-datanode2' >> /data/hawq-devel/etc/slaves
>> bash$  echo 'centos7-datanode3' >> /data/hawq-devel/etc/slaves
>> bash$  hawq restart
>> bash$  make installcheck-good
>>
>> Result: problem was fixed.
>>
>> So the issue here is that installcheck-good is expecting plpython to be
>> present.  We should either make plpython a required language or change the
>> tests to be smart about not relying on optional packages.
>>
>>
>>
>> On Thu, Jan 21, 2016 at 3:26 PM, Caleb Welton <cwelton@pivotal.io> wrote:
>>
>>> Investigation the first failure:
>>>
>>> test errortbl             ... FAILED (6.83 sec)
>>>
>>>
>>> *What is this test verifying: * Error Table Support
>>> *What went wrong: *Basic connectivity to external Tables
>>> *Why did it go wrong: *The test was designed for a single node setup
>>> and the test environment is multinode, combined with some recent changes to
>>> how external tables execute.
>>>
>>> *Background:*
>>> HAWQ supports a robust external table feature to provide access to data
>>> that is not managed directly by HAWQ.  One of the challenges when handling
>>> external data is what do you do with badly formatted data?  The approach
>>> taken in HAWQ is to:
>>> - Error by default, but enable a mechanism to instead omit badly
>>> formatted rows
>>> - When omiting rows we log the badly formatted portions to an "Error
>>> Table" for a user to review and potentially resolve through their own means.
>>>
>>> Example statement using error tables:
>>>
>>>> CREATE EXTERNAL TABLE EXT_NATION1 ( N_NATIONKEY  INTEGER ,
>>>>                             N_NAME       CHAR(25) ,
>>>>                             N_REGIONKEY  INTEGER ,
>>>>                             N_COMMENT    VARCHAR(152))
>>>> location ('gpfdist://localhost:7070/nation_error50.tbl')
>>>> FORMAT 'text' (delimiter '|')
>>>> LOG ERRORS INTO EXT_NATION_ERROR1 SEGMENT REJECT LIMIT 51;
>>>
>>>
>>> The "LOG ERRORS INTO" statement being the critical one that is the focus
>>> of this test suite.
>>>
>>> *Debugging:*
>>>
>>> First thing I did was change to the testing directory and compare the
>>> outputs of the test to the expected outputs of the test:
>>>
>>> bash$  cd src/test/regress
>>> bash$  diff results/errortbl.out expected/errortbl.out
>>>
>>> From which the following can be found:
>>>
>>>> ERROR:  connection with gpfdist failed for
>>>> gpfdist://localhost:7070/nation_error50.tbl. effective url:
>>>> http://127.0.0.1:7070/nation_error50.tbl. error code = 111 (C
>>>
>>>
>>> So we can tell that the issue has to do with the connectivity to the
>>> external tables.
>>>
>>> This is an external table leveraging the gpfdist mechanism for loading
>>> data.  gpfdist is a data loading mechanism that relies on a gpfdist daemon
>>> running on the loading machine.
>>>
>>> Issue #1:  the URL provided to the external table is 'localhost', this
>>> will work fine in a single-node test environment, but since the hawq-devel
>>> environment is a multinode configuration that 'localhost' will be evaluated
>>> for every node that is trying to access data and it will not resolve to the
>>> actual location of the data.
>>>
>>> Attempt to fix #1 - change the URL from localhost -> centos7-namenode.
>>> The right way to handle this would be updating the macro handling in the
>>> input/errortbl.source file so that @hostname@ translates to the correct
>>> hostname rather than localhost, for my own debugging I simply hard codeded
>>> it.
>>>
>>> Result: if I hand start a gpfdist service on the namenode then
>>> everything works correctly, but if I let the test framework start the
>>> gpfdist service then things remain broken.
>>>
>>> After scratching my head briefly my next thought was: what could be
>>> going wrong with starting the gpfdist service?  On the surface this seems
>>> to be working correctly, after all we see the following in the output file:
>>>
>>> select * from gpfdist_start;
>>>>       x
>>>> -------------
>>>>  starting...
>>>> (1 row)
>>>> select * from gpfdist_status;
>>>>                                       x
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>  Okay, gpfdist version "2.0.0.0_beta build dev" is running on
>>>> localhost:7070.
>>>> (1 row)
>>>
>>>
>>> If we take a closer look at how exactly the test framework is
>>> starting/stopping the gpfdist service we find the following:
>>>
>>> CREATE EXTERNAL WEB TABLE gpfdist_status (x text)
>>>> execute E'( python $GPHOME/bin/lib/gppinggpfdist.py localhost:7070 2>&1
>>>> || echo) '
>>>> on SEGMENT 0
>>>> FORMAT 'text' (delimiter '|');
>>>> CREATE EXTERNAL WEB TABLE gpfdist_start (x text)
>>>> execute E'((/data/hawq-devel/bin/gpfdist -p 7070 -d
>>>> /data/hawq/src/test/regress/data  </dev/null >/dev/null 2>&1
&); sleep 2;
>>>> echo "starting...") '
>>>> on SEGMENT 0
>>>> FORMAT 'text' (delimiter '|');
>>>> CREATE EXTERNAL WEB TABLE gpfdist_stop (x text)
>>>> execute E'(/bin/pkill gpfdist || killall gpfdist) > /dev/null 2>&1;
>>>> echo "stopping..."'
>>>> on SEGMENT 0
>>>> FORMAT 'text' (delimiter '|');
>>>
>>>
>>> Here we are using a different type of external table, an "EXECUTE" table
>>> and providing some command line options to start and stop the gpfdist
>>> daemon.   It's a bit hacky, but it get's the job done.  Or rather it
>>> should, and yet somehow a manually started gpfdist works and this doesn't,
>>> so something else is going wrong.
>>>
>>> Next step, investigate if the new external tables are executing on the
>>> right segment (e.g. the master segment).
>>>
>>> After creating the above external tables in a test database I ran:
>>>
>>> bash$  psql -c "explain select * from gpfdist_stop"
>>>
>>>>                                       QUERY PLAN
>>>>
>>>>
>>>> ---------------------------------------------------------------------------------------
>>>>  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..11000.00
>>>> rows=1000000 width=32)
>>>>    ->  External Scan on gpfdist_stop  (cost=0.00..11000.00 rows=1000000
>>>> width=32)
>>>> (2 rows)
>>>
>>>
>>> And here we see something that provides a crucial clue: if the external
>>> table was running on the master node we would not expect to see a Gather
>>> Motion, and yet we do which indicates that for some reason this external
>>> table EXECUTE is running on the wrong node.  This explains why connections
>>> back to "centos7-namenode" were not finding gpfdist running when going
>>> through the test framework.
>>>
>>> The other key piece of information that is needed here is that between
>>> Hawq 1.3 and Hawq 2.0 there were some major changes to the process
>>> architecture and how segments get assigned.  This was foundational work
>>> both for our Yarn Integration Support, but also for improvements with
>>> respect to elasticity within the system.  And one of the areas that it
>>> impacted was the handling of external tables.
>>>
>>> With this last piece of information it became clear that the handling of
>>> EXTERNAL EXECUTE tables with specific segment assignments was broken in
>>> this merge.
>>>
>>> In summary this test is failing for two reasons:
>>> 1. It was designed for a single node setup but is being run in multinode
>>> and there are some test issues to fix.
>>> 2. A change was introduced which introduced a bug in External EXECUTE
>>> tables.
>>>
>>> Along the way we learned a little about external tables, error tables,
>>> gpfdist, external execute, reading query plans, and briefly discussed the
>>> multinode process model of HAWQ.
>>>
>>> Please let me know if you have any additional questions related to any
>>> of the above.
>>>
>>>
>>> On Thu, Jan 21, 2016 at 3:07 PM, Caleb Welton <cwelton@pivotal.io>
>>> wrote:
>>>
>>>> Dev community,
>>>>
>>>> One question that was asked at the last HAWQ's Nest community call was:
>>>>
>>>> "When something goes wrong with installcheck-good what are the common
>>>> reasons?"
>>>>
>>>> As a followup from the call I ran through the excellent dev-environment
>>>> guide that Zhanwei put together [1].  Along the way I ran into some issues
>>>> and filed a jira [2] to g et those fixed.
>>>>
>>>> I thought providing a quick intro of how I looked at the failures
>>>> encountered and how to diagnose them might be insightful to the wider
>>>> community.
>>>>
>>>> Overall the current installcheck-good test suite will fail with 8
>>>> failures in the current dev environment.  I will walk through each of these
>>>> failures in turn in subsequent emails.
>>>>
>>>> [1] https://hub.docker.com/r/mayjojo/hawq-devel/
>>>> [2] https://issues.apache.org/jira/browse/HAWQ-358
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message