drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parth Chandra (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5050) C++ client library has symbol resolution issues when loaded by a process that already uses boost::asio
Date Fri, 18 Nov 2016 00:24:58 GMT
Parth Chandra created DRILL-5050:
------------------------------------

             Summary: C++ client library has symbol resolution issues when loaded by a process
that already uses boost::asio
                 Key: DRILL-5050
                 URL: https://issues.apache.org/jira/browse/DRILL-5050
             Project: Apache Drill
          Issue Type: Bug
          Components: Client - C++
    Affects Versions: 1.6.0
         Environment: MacOs
            Reporter: Parth Chandra
            Assignee: Parth Chandra
             Fix For: 2.0.0


h4. Summary

On MacOS, the Drill ODBC driver hangs when loaded by any process that might also be using
{{boost::asio}}. This is observed in trying to connect to Drill via the ODBC driver using
Tableau.


h4. Analysis
The problem is seen in the Drill client library on MacOS. In the method 
{code}
 DrillClientImpl::recvHandshake
.
.
    m_io_service.reset();
    if (DrillClientConfig::getHandshakeTimeout() > 0){
        m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout()));
        m_deadlineTimer.async_wait(boost::bind(
                    &DrillClientImpl::handleHShakeReadTimeout,
                    this,
                    boost::asio::placeholders::error
                    ));
        DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait timer with
"
                << DrillClientConfig::getHandshakeTimeout() << " seconds." <<
std::endl;)
    }

    async_read(
            this->m_socket,
            boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN),
            boost::bind(
                &DrillClientImpl::handleHandshake,
                this,
                m_rbuf,
                boost::asio::placeholders::error,
                boost::asio::placeholders::bytes_transferred)
            );
    DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake: async read
waiting for server handshake response.\n";)
    m_io_service.run();

.
.

{code}

The call to {{io_service::run}} returns without invoking any of the handlers that have been
registered. The {{io_service}} object has two tasks in its queue, the timer task, and the
socket read task. However, in the run method, the state of the {{io_service}} object appears
to change and the number of outstanding tasks becomes zero. The run method therefore returns
immediately. Subsequently, any query request sent to the server hangs as data is never pulled
off the socket.

This is bizarre behaviour and typically points to build problems. 

More investigation revealed a more interesting thing. {{boost::asio}} is a header only library.
In other words, there is no actual library {{libboost_asio}}. All the code is included into
the binary that includes the headers of {{boost::asio}}. It so happens that the Tableau process
has a library (libtabquery) that uses {{boost::asio}} so the code for {{boost::asio}} is already
loaded into process memory. When the drill client library (via the ODBC driver) is loaded
by the loader, the drill client library loads its own copy of the {{boost:asio}} code.  At
runtime, the drill client code jumps to an address that resolves to an address inside the
libtabquery copy of {{boost::asio}}. And that code returns incorrectly.

Really? How is that even allowed? Two copies of {{boost::asio}} in the same process? Even
if that is allowed, since the code is included at compile time, calls to the {{boost::asio}}
library should be resolved using internal linkage. And if the call to {{boost::asio}} is not
resolved statically, the dynamic loader would encounter two symbols with the same name and
would give us an error. And even if the linker picks one of the symbols, as long as the code
is the same (for example if both libraries use the same version of boost) can that cause a
problem? Even more importantly, how do we fix that?

h4. Some assembly required

The disassembled libdrillClient shows this code inside recvHandshake
{code}
000000000003dd8f    movq    -0xb0(%rbp), %rdi       
000000000003dd96    addq    $0xc0, %rdi
000000000003dd9d    callq   0x1bff42                ## symbol stub for: __ZN5boost4asio10io_service3runEv
000000000003dda2    movq    -0xb0(%rbp), %rdi
000000000003dda9    cmpq    $0x0, 0x190(%rdi)
000000000003ddb4    movq    %rax, -0x158(%rbp)
{code}

and later in the code 
{code}
0000000000057216    retq    
0000000000057217    nopw    (%rax,%rax)
__ZN5boost4asio10io_service3runEv:                 ## definition of io_service::run
0000000000057220    pushq   %rbp
0000000000057221    movq    %rsp, %rbp
0000000000057224    subq    $0x30, %rsp
0000000000057228    leaq    -0x18(%rbp), %rax
000000000005722c    movq    %rdi, -0x8(%rbp)        
0000000000057230    movq    -0x8(%rbp), %rdi
0000000000057234    movq    %rdi, -0x28(%rbp)
{code}


Note that in recvHandshake the call instruction jumps to an address that is an offset (0x1bff42).
This offset happens to be beyond the end of the library. It certainly isn't the offset at
which the io_service::run method is defined (0x57220).

The linker is definitely not resolving the address statically, but we had already guessed
that. It is, in fact, jumping to a stub method and  at runtime this address is being resolved
to the address of the {{io_service::run}} method in libtabquery.

Just to check, in the debugger, we can see the following two implementations of {{io_service::run}}
in the process

{code}
libtabquery.dylib`boost::asio::io_service::run():
   0x10d597a10:  pushq  %rbp
   0x10d597a11:  movq   %rsp, %rbp
   0x10d597a14:  pushq  %rbx
   0x10d597a15:  subq   $0x18, %rsp
   0x10d597a19:  movq   %rdi, %rbx
   0x10d597a1c:  movl   $0x0, -0x18(%rbp)
   0x10d597a23:  callq  0x10d5b73a4               ; symbol stub for: boost::system::system_category()
   0x10d597a28:  movq   %rax, -0x10(%rbp) 
   0x10d597a2c:  movq   0x8(%rbx), %rdi             
   0x10d597a30:  leaq   -0x18(%rbp), %rsi
   0x10d597a34:  callq  0x10d5b71e2               ; symbol stub for: boost::asio::detail::task_io_service::run(boost::system::error_code&)
   0x10d597a39:  cmpl   $0x0, -0x18(%rbp)
   0x10d597a3d:  jne    0x10d597a46               ; boost::asio::io_service::run() + 54
   0x10d597a3f:  addq   $0x18, %rsp
   0x10d597a43:  popq   %rbx
   0x10d597a44:  popq   %rbp
   0x10d597a45:  retq   
   0x10d597a46:  leaq   -0x18(%rbp), %rdi
   0x10d597a4a:  callq  0x10d5b71a6               ; symbol stub for: boost::asio::detail::do_throw_error(boost::system::error_code
const&)
   0x10d597a4f:  nop        

libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57:
   0x11f158300:  pushq  %rbp
   0x11f158301:  movq   %rsp, %rbp
   0x11f158304:  subq   $0x30, %rsp
   0x11f158308:  leaq   -0x18(%rbp), %rax
   0x11f15830c:  movq   %rdi, -0x8(%rbp)
   0x11f158310:  movq   -0x8(%rbp), %rdi
   0x11f158314:  movq   %rdi, -0x28(%rbp)
   0x11f158318:  movq   %rax, %rdi
   0x11f15831b:  callq  0x11f2c210c               ; symbol stub for: boost::system::error_code::error_code()
   0x11f158320:  leaq   -0x18(%rbp), %rsi
   0x11f158324:  movq   -0x28(%rbp), %rax           
   0x11f158328:  movq   0x8(%rax), %rdi
   0x11f15832c:  callq  0x11f2c3516               ; symbol stub for: boost::asio::detail::task_io_service::run(boost::system::error_code&)
   0x11f158331:  leaq   -0x18(%rbp), %rdi
   0x11f158335:  movq   %rax, -0x20(%rbp)
   0x11f158339:  callq  0x11f2c1bf6               ; symbol stub for: boost::asio::detail::throw_error(boost::system::error_code
const&)
   0x11f15833e:  movq   -0x20(%rbp), %rax
   0x11f158342:  addq   $0x30, %rsp
   0x11f158346:  popq   %rbp
   0x11f158347:  retq   

{code}

As suspected, the code for the two versions of {{io_service::run}} is different, so if the
code is executing the wrong version, then the behaviour will be, expectedly, unexpected.

h4. What does not work
Linking statically with boost has no effect. The code is inlined in the first place and is
effectively part of the dynamic library already. 
Changing the load order of the libraries (by specifying LD_LIBRARY_PATH/DYLD_LIBRARY_PATH
does not help). This is because the application library is already loaded into the process.
The linker -prebind flag does not help. The prebind flag is intended to tell the linker to
resolve all addresses at link time. Why this did not work is not clear.
 
Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions) from the {{boost::asio
package}}. At runtime, the MacOs loader assigns the drillClient library to call the functions
defined in libtabquery. This causes the code to behave unpredictably and eventually the ODBC
driver 'hangs' waiting for data from the server.
 
Because the symbol linkage is being determined at runtime, changing the linker settings in
the Drill client build has no effect. This is true even if you build with static linkage (a
remarkable feature of MacOS!). Also, the boost builds between libtabquery and libdrillClient
are different even if we use the same boost version; the compiled code is different. This
is a critical part of the problem because if the compiled code were the same there would be
no problem if the code was called using the libtabquery version instead of the libdrillClient
version.
 
h4. Solution
The only way to resolve this is to use a 'shaded' version of boost in the drill client library.
Luckily for us C++ namespaces, boost's bcp tool, and CMake together provide a way to rename
the boost namespace to any name we like and use it in the drill client code. This effectively
renames every symbol from boost to a different name using a new namespace name and the symbol
name conflict does not arise.
Using this build of boost, and using static linking (just to make sure) in the Drill client
library, one is able to connect to and run queries against Drill from Tableau.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message