Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 2 Dec 2016 01:40:36 +0000 (UTC)
From: Robert Grandl <rgrandl@yahoo.com.INVALID>
Reply-To: Robert Grandl <rgrandl@yahoo.com>
To: Rajesh Balamohan <rajesh.balamohan@gmail.com>
Cc: Dev <dev@hive.apache.org>, Dev <dev@tez.apache.org>
Message-ID: <1599779440.4836961.1480642836626@mail.yahoo.com>
In-Reply-To: <CAJqL3EJTAngCNqQe1_6GWiqNdaOyRR5YguUJS6iDq_RqcO1Q1Q@mail.gmail.com>
References: <532294359.544706.1480117388673.ref@mail.yahoo.com> <532294359.544706.1480117388673@mail.yahoo.com> <CAJqL3EJTAngCNqQe1_6GWiqNdaOyRR5YguUJS6iDq_RqcO1Q1Q@mail.gmail.com>
Subject: Re: Data manipulation in Hive over Tez
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_4836960_727804766.1480642836623"
archived-at: Fri, 02 Dec 2016 15:25:40 -0000

------=_Part_4836960_727804766.1480642836623
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable


 Thanks Rajesh for your answer. That was really helpful.=20

I would like to ask you few more questions. I am trying to better understan=
d how the <Key, Value> pairs are propagated and processed at various vertic=
es.=20

Edge:- encodes the data movement logic
Processing logic:- process and partition the output key space according to =
its logic- also the processing logic in every stage follows a sequence of o=
perators through which every key, value pair is passed
My questions are:1)I am a bit confused till what extent the processing logi=
c in a stage goes in (especially Reduce Tasks).=C2=A0 Like, given an input =
in terms of <Key, Value> pairs what are typical patterns of processing logi=
c i.e. what kind of <Key, Value> pairs it can produce and how much changes =
can the vertex do.=C2=A0
This question is a bit confusing, but basically I am trying to understand w=
hat kind of patterns of=C2=A0 input {<Key, Value>, output <Key, Value>} pat=
terns can be handled in general by a typical processing logic for SQL queri=
es written in Hive atop Tez.=20

2) Can't really wrap up my head how much connection exists between data mov=
ement encoded in edges and how the <Key, Value> pairs are generated by a ve=
rtex and moved to corresponding downstream vertices.

Thanks again for your answers,Robert

  =20

 On Tuesday, November 29, 2016 4:04 AM, Rajesh Balamohan <rajesh.balamohan@=
gmail.com> wrote:
=20

 Hi Robert,

1. At high level, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/DagUtils.java where different vertices, edges etc gets created as
per the execution plan.
Consider a vertex as a combination of input, processing logic and output.
Different vertices are connected together by edges which can define the
data movement logic (broadcast or scatter-gather or one-to-one etc).
In the edge configuration, type of key/value class is defined. This DAG is
submitted to Tez for execution.

2. For task processing, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/TezProcessor.java in hive side.

3. In Tez side, there are different type of inputs and outputs available.
E.g OrderedGroupedKVInput, UnorderedKVInput, OrderedPartitionedKVOutput,
UnorderedKVOutput, UnorderedPartitionedKVOutput etc are available for
reading/writing data.
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/OrderedGroupedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/UnorderedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
output/UnorderedKVOutput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/output/
OrderedPartitionedKVOutput.java

For instance, ordered output would write the data in sorted format. There
are different type of sorters available in Tez which can be chosen at
runtime (DefaultSorter, PipelinedSorter). Intermedate data of tasks are
written in
"IFile" format which is similar to the IFile format in MR world, but has
more optimizations involved in tez impl.

https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/sort/impl/IFile.java.

As far as the reading is concerned, key/value class and serializer
information is passed on as a part of creating the DAG. E.g
https://github.com/apache/hive/blob/master/ql/src/java/
org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L360
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/common/readers/
UnorderedKVReader.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/ValuesIterator.java

~Rajesh.B

On Sat, Nov 26, 2016 at 5:13 AM, Robert Grandl <rgrandl@yahoo.com.invalid>
wrote:

> Hi guys,
> I am not sure where is the right place to post this question hence I send
> it to both hive and tez dev mailing lists.
>
> I am trying to get a better understanding of how the input / output for a
> task is handled.=C2=A0 Typically input stages read the data to be process=
ed.
> Next, all the data will flow in forms of key / value pairs till the end o=
f
> the job's execution.
>
> 1. Could you guys can point me out to the key files where I should look t=
o
> identify that? I am mostly interested to intercept where data is read by =
a
> task and wher the data is written after the task process the input=C2=A0 =
data.
>
> 2. Also, is there a way I can identify the types (and hence read the
> actual values) of a key / value pair instead of just Object key, Object
> value?
> Thanks in advance,Robert
>
>


--=20
~Rajesh.B


  =20
------=_Part_4836960_727804766.1480642836623--