Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5D0C4200BCE for ; Fri, 2 Dec 2016 16:25:40 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5C153160B24; Fri, 2 Dec 2016 15:25:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7E364160B16 for ; Fri, 2 Dec 2016 16:25:39 +0100 (CET) Received: (qmail 28974 invoked by uid 500); 2 Dec 2016 15:25:38 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Delivered-To: moderator for dev@hive.apache.org Received: (qmail 25204 invoked by uid 99); 2 Dec 2016 01:40:58 -0000 X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.12 X-Spam-Level: X-Spam-Status: No, score=-1.12 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 63527.5044.bm@omp1049.mail.bf1.yahoo.com X-YMail-OSG: RC8E8_IVM1nN80fmp_fwU9LW8QSh.CRwGih5hoABz05KVN46KIwVwQFJfOlVVu3 35f0UqtGuuvhuvNwXYeWcxfKOk4Yie8ZUuXI7owvhaKn9mQD0uxxl.NfJhRaDY4b.X5wRTKzaCW5 KPxXbHI2URJBTImBidbmpoIYgVb8ufZCu6oUY1meHVeTg4Fn6f_iXaqLpqPkdrO9HifD2L7O4QTj mI_kCCKiDnhf.Oczw.Zgfl2xDjPZEDBFigzwK_DOwBAo1eOeZ3EYeGEwLwOQltwtqmYo9cSvt0Lx .rxlo1ye_8u.jIYf5oL6tuLPbtts27xmphkJUjC17zYPnBHl00PPzdkLfoacCOjBFzmXW4VOy66C bUW1n4Vuv95pTJt32nX0km_BsQXi5XEQ6qT7vQAj11DqiSnZ2e_hJIhn7Ndf7utFnhZrQ6szl7RS g_DljYk.gaaKYQV.VJUADvbqEEKmz377Wur9dtpp0JMQYI5ntp1R8zsfTHshMsPxK6qNnM1ekqZE 6ZqlAVqIMjQKPOSwM Date: Fri, 2 Dec 2016 01:40:36 +0000 (UTC) From: Robert Grandl Reply-To: Robert Grandl To: Rajesh Balamohan Cc: Dev , Dev Message-ID: <1599779440.4836961.1480642836626@mail.yahoo.com> In-Reply-To: References: <532294359.544706.1480117388673.ref@mail.yahoo.com> <532294359.544706.1480117388673@mail.yahoo.com> Subject: Re: Data manipulation in Hive over Tez MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_4836960_727804766.1480642836623" archived-at: Fri, 02 Dec 2016 15:25:40 -0000 ------=_Part_4836960_727804766.1480642836623 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Rajesh for your answer. That was really helpful.=20 I would like to ask you few more questions. I am trying to better understan= d how the pairs are propagated and processed at various vertic= es.=20 Edge:- encodes the data movement logic Processing logic:- process and partition the output key space according to = its logic- also the processing logic in every stage follows a sequence of o= perators through which every key, value pair is passed My questions are:1)I am a bit confused till what extent the processing logi= c in a stage goes in (especially Reduce Tasks).=C2=A0 Like, given an input = in terms of pairs what are typical patterns of processing logi= c i.e. what kind of pairs it can produce and how much changes = can the vertex do.=C2=A0 This question is a bit confusing, but basically I am trying to understand w= hat kind of patterns of=C2=A0 input {, output } pat= terns can be handled in general by a typical processing logic for SQL queri= es written in Hive atop Tez.=20 2) Can't really wrap up my head how much connection exists between data mov= ement encoded in edges and how the pairs are generated by a ve= rtex and moved to corresponding downstream vertices. Thanks again for your answers,Robert =20 On Tuesday, November 29, 2016 4:04 AM, Rajesh Balamohan wrote: =20 Hi Robert, 1. At high level, you can refer to https://github.com/apache/ hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ exec/tez/DagUtils.java where different vertices, edges etc gets created as per the execution plan. Consider a vertex as a combination of input, processing logic and output. Different vertices are connected together by edges which can define the data movement logic (broadcast or scatter-gather or one-to-one etc). In the edge configuration, type of key/value class is defined. This DAG is submitted to Tez for execution. 2. For task processing, you can refer to https://github.com/apache/ hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ exec/tez/TezProcessor.java in hive side. 3. In Tez side, there are different type of inputs and outputs available. E.g OrderedGroupedKVInput, UnorderedKVInput, OrderedPartitionedKVOutput, UnorderedKVOutput, UnorderedPartitionedKVOutput etc are available for reading/writing data. https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/ input/OrderedGroupedKVInput.java https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/ input/UnorderedKVInput.java https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/ output/UnorderedKVOutput.java https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/output/ OrderedPartitionedKVOutput.java For instance, ordered output would write the data in sorted format. There are different type of sorters available in Tez which can be chosen at runtime (DefaultSorter, PipelinedSorter). Intermedate data of tasks are written in "IFile" format which is similar to the IFile format in MR world, but has more optimizations involved in tez impl. https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/ common/sort/impl/IFile.java. As far as the reading is concerned, key/value class and serializer information is passed on as a part of creating the DAG. E.g https://github.com/apache/hive/blob/master/ql/src/java/ org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L360 https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/common/readers/ UnorderedKVReader.java https://github.com/apache/tez/blob/master/tez-runtime- library/src/main/java/org/apache/tez/runtime/library/ common/ValuesIterator.java ~Rajesh.B On Sat, Nov 26, 2016 at 5:13 AM, Robert Grandl wrote: > Hi guys, > I am not sure where is the right place to post this question hence I send > it to both hive and tez dev mailing lists. > > I am trying to get a better understanding of how the input / output for a > task is handled.=C2=A0 Typically input stages read the data to be process= ed. > Next, all the data will flow in forms of key / value pairs till the end o= f > the job's execution. > > 1. Could you guys can point me out to the key files where I should look t= o > identify that? I am mostly interested to intercept where data is read by = a > task and wher the data is written after the task process the input=C2=A0 = data. > > 2. Also, is there a way I can identify the types (and hence read the > actual values) of a key / value pair instead of just Object key, Object > value? > Thanks in advance,Robert > > --=20 ~Rajesh.B =20 ------=_Part_4836960_727804766.1480642836623--