hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LLBian <linanmengxia...@126.com>
Subject hive on tez serialization and deserialization( custom Serde initialize() just called one time in hive client, when split in AM)
Date Thu, 21 Jan 2016 11:40:24 GMT

Hello,all:
     
      【My environment versions are :Hadoop 2.6.0 、hive 1.2.1、tez 0.7.0】
      Our term develop a plug-in in hive, its function is similiar to hive-hbase-handler.

      Now I executed a HQL “select count(*) from h_im;”(h_im is an external table, hbase
table) in hive CLI, it throw exceptions:
    (I am sorry, I can not copy the error information here, because we use inner network,so
some information will be omitted)
      —————----------------------------------------
      INFO [Dispatcher thread: Central] history.HistoryEventHandler: .... .....
                  at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor.run(TezProcessor.java:172)
                  at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
                  at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:337)
                  ......  .......
      Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing writable org.apache.hadoop.hive.hbase....
                   at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
                   at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
                   at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:367)
                   at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor.run(TezProcessor.java:149)
                   .... 14 more
      Caused by : java.lang.NullPointerException
                   at com.fiberhome.nebula.datacenter.hbasehandler.NBHBaseSerde.deserialize(NBHBaseSerde.java:210)
                   at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:145)
                   at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$2(MapOperator.java:143)
                   at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:512)
                   ..... 18 more
-------------------------------------------------------------------------------------
         I know that, it is custome storageHandler about hive, but, now, my questions are
about how the two(tez&hive) to work together:
       There, NBHBaseSerde is a custom SerDe:
                   NBHBaseSerde  extends ColumarSerDeBase implements Configurable{
                            @Override
                            initialize() { ......}
                            
                            deserialize() { ......}                         
                           ....
                   }
          In order to debug and solve the error above, I printed some logs in related classes(local
mode executed right, cluster mode is difficult to debug), but there is no log message
printed in yarn 8088 container logs:
(1)as showed above,the exceptions said,the nullpointer occured in “NBHBaseSerde.deserialize(NBHBaseSerde.java:210)”,and
line 210 is :
     -------------------------------------------------------------------------------------------------------------
  line 210:  this.pair.setValue(zoneid);
   - -----------------------------------------------------------------------------------------------------------
    
     I guess mybe "pair" is Null; so I printed one log before line 210( line 210 is not the
first line in deserialize()):
   ---------------------------------------------------------------------------------------------------------------
LOG.info("deserialize begine ....."); //this log message is in he first line of deserialize()

LOG.info("....pair.toString....." + pair.toString());// this log message is just before "this.pair.setValue(zoneid)"
----------------------------------------------------------------------------------------------------------------
       While after I changed NBHBaseSerde.class of the JAR file, some strange things happened
that I still do not understand:
       ①there is no log message in hive log and yarn container log(port:8088) , no "deserialize
begine .....",no "....pair.toString.....".
       ②the exception said " Caused by : java.lang.NullPointerException at com.fiberhome.nebula.datacenter.hbasehandler.NBHBaseSerde.deserialize(NBHBaseSerde.java:211)
", that is to say “LOG.info("....pair.toString....." + pair.toString());”is the error
line. 
        I was confused... they should be executed.But where were the log messages? 

(2)   the parameter "pair" was assigned a value in NBHBaseSerde.initialize(). 
             There was a hint LOG message "Serde initializeation begine.." in the first line
of NBHBaseSerde.initialize(), and  I can only find one message of "Serde initializeation 
begine.." in hive log. So I guess NBHBaseSerde.initialize() was executed just one time during
the entire process of HQL execution. 
            It's said that,the log message can prove that this piece of code( NBHBaseSerde.initialize())
just executed only one time in the hive client, it was not called after job submitted.---------Am
I right?

        There are some other parameters like "pair" which were setted  values in NBHBaseSerde.initialize()
lost thrie values after DAG job submitted to the cluster. So I use set() to save these values
in NBHiveHBaseUtils.java, the method was resetting these parameters values in MapRecordProcessor.init().
Like this:
-------------------------------------------------------------------------
       legacyMRInput = getMRInput(inputs); //this is source code
       ......
       NBHiveHBaseUtils.setPair(pair);//I added
        ....... .....
---------------------------------------------------------------------------
        It was failed.  Because I found that ,when I set "hive.compute.splits.in.am=true",
the logical was different to triditional mr's, it seems MapRecordProcessor.init() was not
executed(because log message in MapRecordProcessor.init() were not printed).
         But from the exception message, I  also found this "org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor.run(TezProcessor.java:149)",
In my hive source code:
---------------------------------
line 147: MRTaskReporter mrReporter = new MRTaskReporter(getContext));
line 148: rpoc.init(mrReporter, input, outputs);
line 149: rpoc.run();
----------------------------------------
     There rpoc is MapRecordProcessor. It means MapRecordProcessor.init() was executed. But
why I couldn't find any log printed in it?
      I also add a LOG message before line 149, it wasn't printed in hive log or container
log. why? I can not understand.

(3)As the title says, I really can not understand what's tez's logic in processing hiveQL
when need serialization and deserialization. I also study hive and tez source code,  I know
tez's split mechanism can connect  custom storageHandler by HiveInputFormat.  I think mybe
I should to add NBHBaseSerde.initialize() in somewhere to call this logic again, but I  have
not found appropriate places. 

   I am eager to get your guidance. I would very much appreciate your help.
   Any reply will be appreciated.

Thankyou & Best Regards.

---LLBian



     
      
Mime
View raw message