impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@apache.org>
Subject Re: SendersBlockedTimer
Date Thu, 25 May 2017 21:32:22 GMT
On 25 May 2017 at 12:19, Evo Eftimov <evo.eftimov@isecc.com> wrote:

> Hi Henry,
>
>
>
> I was referring specifically to the EXCHANGE_NODE section of the
> Coordinator Fragment – doesn’t that pin it down specifically to the
> Coordinator Node ie the node to which the JDBC Client is connected directly
> ?
>
>
>
> Also how can the streaming the records from simple full table scan query
> like “select * from table” be accelerated so that SendersBlockedTimer
> value does not represent the 95% of the overall time of the query.
> Basically imagine you have a 3GB parquet table in Impala and a JDBC Driver
> Client connected to the Coordinator ImpalaD and trying to stream out all of
> the data in the table (3GB) as quickly as possible.
>
> The execution part of the query completes blindingly fast and the data is
> streamed out of HDFS within 30 seconds. However the Fetch phase of the full
> table scan query takes 15 min as 14 min and 30 sec of that time is  the
> value in the SendersBlockedTimer
>

>
> The JDBC Client uses the latest Cloudera JDBC driver for Impala (which is
> actually the Simba driver) and performs nothing but just ResultSet.next()
> ie not parsing and data transformation of the columns of each row, no
> output to screen or disk etc. The network between the JDBC Client and
> Coordinator is 10 GB and “hdfs client get” of the csv version of the same
> table takes only 30 sec ….
>
>
>
> Out of the above 15 min total time, Client Fetch Wait Time is 35% or about
> 6 min. Then we also have  SendersBlockedTimer of 14 min and 30 sec – so
> who is to be blamed here for the slow streaming of records compared to hdfs
> get – a) innefecient implementation of the JDBC Client or the Coordinator
> Node needing more resources like more parallel threads and therefore CPU
> cores etc
>
>
>
> How do we interpret the above two figures and what do they point to - the
> jdbc driver or the Coordinator Node
>

Most likely the driver, as the query takes 6 minutes, per the Client Fetch
Wait Time. SendersBlockedTimer tracks the amount of time for which at least
one sender was blocked. Since it is high, we know that the coordinator is
moving slower than the results are being sent to it. The coordinator does
very little in a SELECT * query, so the likelihood is that it is serving
rows to the client as fast as it can consume them. Therefore I'd expect the
client to be the bottleneck.

Try using the impala-shell, and setting -B (and redirecting the output to
/dev/null); this is about as fast as a single client can go right now and
should give you a feeling for a lower bound on the query performance.

How much data does this query return? The client API and driver are not
really optimized for large ETL-style retrieval - for that you might be
better off using INSERT to write some files to HDFS, and then downloading
them in parallel from HDFS.

Best,
Henry


>
>
> Regards,
>
> Evo
>
>
>
> *From:* Henry Robinson [mailto:henry@apache.org]
> *Sent:* Thursday, May 25, 2017 7:23 PM
> *To:* user@impala.incubator.apache.org; evo.eftimov@isecc.com
> *Subject:* Re: SendersBlockedTimer
>
>
>
> Hi Evo -
>
>
>
> Just to clarify: the EXCHANGE_NODE is the operator in the plan tree which
> mediates communication between workers, not between the client and the
> coordinator.
>
>
>
> The SendersBlockedTimer measures the amount of time that senders have row
> batches to deliver to an exchange node, but the exchange is busy delivering
> a previously sent row batch. That is, the senders are sending faster than
> the exchange node (and the upstream plan) processes those rows.
>
>
>
> In a select * from table query, there'll be one exchange on the
> coordinator, but that's not generally true - exchanges connect all the
> fragment instances. Having the senders blocked in this case is typical,
> because there'll lots of senders sending at high rate fanning in to a
> single receiver, serving a single client.
>
>
>
> The delivery of rows to the client is managed by the coordinator fragment
> instance through a different part of the code to the exchange node.
>
>
>
> Henry
>
>
>
> On 25 May 2017 at 08:31, Evo Eftimov <evo.eftimov@isecc.com> wrote:
>
> What is the purpose of SendersBlockedTimer attribute in the EXCHANGE_NODE
> section of the Coordinator Fragment – part of the PROFILE of SQL statement
> executed by Impala
>
>
>
> I have reviewed the Impala source code and know that the Exchange Node
> uses a Blocking Queue as part of “Stream Manager” module which it
> instantiates
>
>
>
> In the specific context I am interested in, the Exchange Node returns the
> row from a result set to a JDBC driver client. The result set is produced
> by a simple full table scan only query of the type “select * from table”
>
>
>
> The “Sender” Parallel Threads (presumably with the Exchange Node) publish
> rows to the Blocking Queue also in the Exchange Node and the JDBC client
> reads rows from the same queue via remote JDBC session / connection over
> TCP/IP – is that a correct description of how the Exchange Node mediates
> between JDBC client on the one hand and ImpalaD workers on the other? Btw
> the Exchange Node is part of the Coordinator Node in terms of terminology –
> right?
>
>
>
> My specific question is what is the purpose/meaning  of
>   SendersBlockedTimer – e.g. does it mean that the Sender Threads WITHIN
> the Exchange Node have been in a blocked state for the time shown in the
> value of the attribute. And if this is correct then does that mean that
> they have been blocked because the JDBC Client couldn’t not keep up with
> draining the Blocking Queue during the aggregated time duration in
> SendersBlockedTimer?
>
>
>
>
>

Mime
View raw message