drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parth Chandra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4313) C++ client - Improve method of drillbit selection from cluster
Date Sat, 20 Feb 2016 00:55:18 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155239#comment-15155239

Parth Chandra commented on DRILL-4313:

Here's what I have seen/found 
Tableau can use a connection pool to parallelize the execution of queries to as single data
source. Under the covers Tableau will creates a new process for every connection to Drill.
It will then proceed to distribute the queries in some fashion across the opened connections.

In a test Tableau dashboard, which consisted of 29 queries being sent to Drill, the pattern
I saw was that Tableau would create a single connection that ran a couple of metadata queries
and then created nine more connections (each in a new process) and the ten connections executed
the remaining queries. 

The problems -
1) Logging is not safe 
The creation of many processes has an interesting side effect. The ODBC driver initializes
the drill client library logging with a new name every time it is loaded and uses a timestamp
to create unique names. Since the Tableau pool is initialized at the same time, most connections
get created with the same file name, and only one succeeds. The other connections then cannot
log anything. Additionally, logging is not really thread safe in the client. Multiple threads
tend to make the log less readable as log statements from two threads get intermixed.

2) The std::rand function is unreliable
The C library rand() function is close to being removed from the standard because it is inherently
flawed. (See http://cpp.indi.frih.net/blog/2014/12/the-bell-has-tolled-for-rand/ for an easy
to read explanation)
The alternative is to switch to Boost (or upgrade the build to c++ 11 ) which provide a random
library that is much better. Both provide a random seed method that can use device dependent
methods to provide a truly random seed, and a pseudo random number generator (mt19937) that
performs much better.

3) With logging fixed, and the random number generator updated, Tableau's pattern still causes
uneven distribution. A situation similar to the one below occurred fairly frequently -
(Note a similar unevenness occurred with a 10 node cluster as well)
   Tableau connections - 10
   Cluster size - 3 
   Queries - 29

   Connection       Node      Num queries sent
   1                        n1          5
   2                        n2          2
   3                        n1          4
   4                        n3          1
   5                        n3          2
   6                        n2          4
   7                        n1          3
   8                        n3          2
   9                        n2          3 
  10                       n1          3

n1 has 15 queries, while n3 has only 5 queries sent to it.

4) Client side pooling improves this but is sometimes still a little askew. The worst I saw
   n1 - 12 queries
   n2 - 9   queries
   n3 - 8   queries

Client side pooling has an additional problem, we cannot maintain session settings across
the pool without additional work.  One option is for the client library to maintain all alter
session queries and replay them across all connections in the pool (ugly). Another option
is to create a session id and maintain the id in Zookeeper. As part of the handshake the client
would either request a new session or ask to join (reuse) an existing session based on the
session id. (This is not simple  and promises to cause grief, IMHO). This option also breaks
backward compatibility.

I have the implementation for client side connection pooling with the caveat that the user
can only use system level options. Since Tableau appears to create a connection pool itself,
I don't see how Tableau would be using session level options anyway. 
I don't think this should be exposed to the end user unless they really want it (and it appears
that some do), so this would be something that can not be enabled thru the ODBC driver but
by some other means like an environment variable. It would also be off by default.


> C++ client - Improve method of drillbit selection from cluster
> --------------------------------------------------------------
>                 Key: DRILL-4313
>                 URL: https://issues.apache.org/jira/browse/DRILL-4313
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.6.0
> The current C++ client handles multiple parallel queries over the same connection, but
that creates a bottleneck as the queries get sent to the same drillbit.
> The client can manage this more effectively by choosing from a configurable pool of connections
and round robin queries to them.

This message was sent by Atlassian JIRA

View raw message