hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LLBian <>
Subject How the actual "sample data" are implemented when using tez reduce auto-parallelism
Date Sat, 27 Feb 2016 03:13:48 GMT

Hello, Respected experts:
Recently, I am studying  tez reduce auto-parallelism, I read the article "Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158.
I found the HIVE-7158 said that "Tez can optionally sample data from a fraction of the tasks of a vertex and use that information to choose the number of downstream tasks for any given scatter gather edge".
I know how to use this optimization function,but I was so confused by this:

" Tez defines a VertexManager event that can be used to send an arbitrary user payload to
the vertex manager of a given vertex. The partitioning tasks (say the Map tasks) use this
event to send statistics such as the size of the output partitions produced to the ShuffleVertexManager
for the reduce vertex. The manager receives these events and tries to model the final output
statistics that would be produced by the all the tasks."

(1)How the actual "sample data" are implemented?I mean how does the reduce ShuffleVertexManger
know how many sample data is enough to  estimate the whole vertex parallelism, is that relates to
reduce slow-start?  I studied the source code of apache tez-0.7.0, but still not very clear.
Mybe I was too stupid to understood that.
(2)Is the partitioning tasks proactively send their output data stats to the consumer ShuffleVertexManger
? The event is sended by RPC or http?

I am eager to get your instruction. 

Any reply would be very very grateful.

View raw message