hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <nj...@facebook.com>
Subject Re: OutOfMemory when doing map-side join
Date Mon, 15 Jun 2009 06:02:01 GMT
I will take a look and get back to you



On 6/14/09 10:59 PM, "Min Zhou" <coderplay@gmail.com> wrote:

failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application
x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown
exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <njain@facebook.com> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned
- and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being
loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application
x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <coderplay@gmail.com> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log
w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST
a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL
w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL
w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <njain@facebook.com> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <coderplay@gmail.com> wrote:


1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map
node in normal map-reduce with the same cluster, where same heap size is guranteed between
running hive map-side join and our map-reduce job.  OOM exceptions never happened in that
only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was
set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <njain@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and
try ?

It would be useful to try both of them independently - it would give a good idea of memory
consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and
load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <coderplay@gmail.com> wrote:

20k







Mime
View raw message