hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-584) Clean up global and ThreadLocal variables in Hive
Date Sat, 27 Jun 2009 00:37:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724771#action_12724771

Zheng Shao commented on HIVE-584:

Prasad and I had some offline discussions. First of all, some high-level conclusions:

0. Lifetime of the objects: The lifetime of all hive operators are a part of the lifetime
of their configuration. The lifetime of all Tasks (ExecDriver, MapRedTask, MoveTask, etc)
are a part of the lifetime of the db connection. The lifetime of a CommandProcessor is also
a part of the lifetime of the db connection (well in some CommandProcessor we want to switch
and close the db connection, as the actual operation that we want to do). The CliDriver and
HiveServerHandler should correspond to a session.
1. ThreadLocal is good for keeping thread-local variables for ease-of-use (don't need to pass
the object around, or give it in the constructor), ease-of-debugging (because there is no
chance that another thread will change the content of the thread-local storage), and security
reasons (the same as debugging).
2. Passing objects in constructor (and add a getter) is good for easy understanding of the
program flow, as well as allowing the same thread to have 2 different objects (e.g. db connection).
 There is not such a strong need since all db connection calls are blocking. We can always
switch the thread-local db connection to the correct one before we start every call.

Second, we have to make a choice between the 3:
A. Make Hive consists of the db connection and the conf. Make it a thread local storage. 
Tasks (including ExecDriver), CommandProcessors(including Driver), and also CliDriver/HiveServerHandler
will access this thread-local db connection and conf.  Make SessionState consists of other
seesion specific things like stdout, history, etc (but NOT Hive). SessionState is also thread-specific
and CliDriver will access SessionState for these information, as well as access Hive for db
connection etc. So both Hive and SessionState are independent and thread-local.

B. Make Hive consists of the db connection and the conf. Pass Hive as a constructor/initialize
parameter to all Tasks (including ExecDriver) and CommandProcessors(including Driver).  Make
SessionState  consists of session specific things like stdout, history, and ALSO Hive. CliDriver/HiveServerHandler
will use the thread-specific SessionState for all things.  So only SessionState is thread-local
(while Hive is part of it).

C. The same as B, except letting SessionState be a parameter to the constructor of CliDriver/HiveServerHandler.
So there is no thread-local storage at all. So nothing is thread-local.

The benefit shared by all these 3 is that Tasks (including ExecDriver) and CommandProcessors(including
Driver) don't need to deal with Session - they just need a Hive.

NOTE: All hive operators (at mapper and reducer) should continue to use the configuration
that is passed, just to conform to hadoop model.

> Clean up global and ThreadLocal variables in Hive
> -------------------------------------------------
>                 Key: HIVE-584
>                 URL: https://issues.apache.org/jira/browse/HIVE-584
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
> Currently in Hive code there are several global and ThreadLocal variables that need to
be cleaned.
> Specifically, the following classes are involved:
> 1. HiveConf: contains hive configurations (and a classloader)
> 2. Hive class: contains a static member Hive db. Hive class contains a member HiveConf
conf, as well as a ThreadLocal storage of IMetaStoreClient.
> 3. SessionState: contains a static ThreadLocal storage of SessionState. SessionState
class contains a Hive db, a HiveConf conf, a history logger, and a bunch of standard input/output
> 4. CliSessionState: SessionState plus some command options and the command file name.
> 5. All classes that try to get Hive db or HiveConf from global static Hive db, or SessionState.
> There are several problems with the current design. To name a few:
> 1. SessionState instances are ThreadLocal, but SessionState contains Hive db which also
contains ThreadLocal storage. Not sure a db can be shared across different threads or not?
What is the global static Hive db?
> 2. We pass HiveConf and Hive db in two ways to classes like Task: Sometimes through initialize(),
sometimes through SessionState. This complicates the code a lot. It's hard to know which HiveConf
and which db we should use.
> We need to think about a better way to do it.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message