hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gurmukh singh (JIRA)" <>
Subject [jira] [Created] (HIVE-18572) The record readers; InputFormat needs to be fixed for Tez as it generates 1 split
Date Mon, 29 Jan 2018 20:59:00 GMT
gurmukh singh created HIVE-18572:

             Summary: The record readers; InputFormat needs to be fixed for Tez as it generates
1 split
                 Key: HIVE-18572
             Project: Hive
          Issue Type: Bug
    Affects Versions: 2.1.0
            Reporter: gurmukh singh

The record reader needs to be fixed for tez, as it generates only 1 split due to the {color:#333333}MRv2
CombineInputFormat broke that rule{color}.

This has been fixed in MR but not Tez.

I am seeing a strange behaviour in tez; it is seeing all data as a single split under hive,
where as MR see all 79 files. This is causing all the data to go to a single map

TEZ Processing
INFO  : Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=1, numRows=79575067,
totalSize=3.164.605.993, rawDataSize=112439569671]
ELAPSED TIME: 1958.99 s

MR Processing
Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, numRows=79575067,
totalSize=3172280778, rawDataSize=112418416260]

Log Tez
2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMapredSplitsGrouper|:
Desired splits: 381 too large.  Desired splitLength: 8311476 Min splitLength: 50331648
New desired splits: 381 Final desired splits: 381 All splits have localhost: false Total length:
19166265870 Original splits: 1
2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMapredSplitsGrouper|:
Using original number of splits: 1 desired splits: 381
2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0] |tez.SplitGrouper|: Original
split size is 1 grouped split size is 1, for bucket: 1
2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0] |tez.HiveSplitGenerator|: Number
of grouped splits: 1
2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0] |dag.RootInputInitializerManager|:
Succeeded InputInitializer for Input: usage on vertex vertex_1517207496169_0085_1_00 [Map
2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Cannot init vertex:
vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 numUnitializedEdges: 0 numInitializedInputs:
1 initWaitsForRootInitializers: true
2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Got updated RootInputsSpecs:
\{usage=forAllWorkUnits=true, update=[1]}
2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Vertex vertex_1517207496169_0085_1_00
[Map 1] parallelism set to 1

As per discussion with Gopal Vijayaraghavan:
 that line, right there MRv2 CombineInputFormat broke that rule, so the record readers had
to be fixed to handle it

This message was sent by Atlassian JIRA

View raw message