hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remus Rusanu <>
Subject A question about SMB join and the MapWork pathToPartitionInfo/pathToAliases: considered 'local' plan for the 'small' SMB aliases
Date Wed, 04 Dec 2013 13:06:00 GMT
Hi all,

I'm working on HIVE-5595 to add vectorization support for SMB join operators. The problem
I'm facing is that the vectorized record readers (eg. VectorizedOrcRecordReader) have a dependency
on the MapWork.pathToPartitionInfo (see VectorizedRowBatchCtx.init).

What I discovered though is that for SMB join plans, this map (along with the related pathToAliases
map) is incomplete. During the population, which occurs in GenMapRedUtils.setTaskPlan, the
aliasToPartnInfo gets always populated:

plan.getAliasToPartnInfo().put(alias_id, aliasPartnDesc);

but the pathToAliases and pathToPartitionInfo maps are skipped for local case:

    if (!local) {
      while (iterPath.hasNext()) {
        plan.getPathToPartitionInfo().put(path, prtDesc);

And local in this case, for the 'small' alias, is true, being set up on the call stack by

      boolean local = pos != mapJoin.getConf().getPosBigTable();
      if (oldTask == null) {
        assert currPlan.getReduceWork() == null;
        initMapJoinPlan(mapJoin, currTask, ctx, local);

My question is towards SMB/MapJoin experts for clarification on this anomaly. SMB join is
not local, but is treated as local. The resulted plan info has these anomalies, aforementioned
maps are incomplete. Is the local-=true intentional in the SMB case, or is just leftover from
the original MapJoin implementation? Should SMB join set it to false, or will the sky collapse?
I can think of several 'workarounds', but there is too much context here that I don't have
a strong grok on.

Relevant stack:

GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, GenMRProcContext,
PrunedPartitionList) line: 658
GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, GenMRProcContext)
line: 400
Task<Serializable>, GenMRProcContext, boolean) line: 157
MapJoinFactory$TableScanMapJoinProcessor.process(Node, Stack<Node>, NodeProcessorCtx,
Object...) line: 219
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90
GenMapRedWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94
GenMapRedWalker.walk(Node) line: 54
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>)
line: 109
MapReduceCompiler.compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>,
HashSet<WriteEntity>) line: 267
SemanticAnalyzer.analyzeInternal(ASTNode) line: 8927


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message