hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <nj...@fb.com>
Subject Re: Query Optimization in Hive
Date Tue, 01 Feb 2011 06:16:27 GMT
Bharath,

This would be great.

Why donĀ¹t you write up something about how you are planning to proceed ?
File a new jira and load some design notes/spec. there.
We can definitely sync up. from there.


This feature would be very useful to the community - We, at facebook,
Would definitely like to use it.


Thanks,
-namit


On 1/31/11 9:50 PM, "bharath vissapragada"
<bharathvissapragada1990@gmail.com> wrote:

>Hi Ning,Anja,
>
>I am doing my Masters thesis on this topic . I have implemented all
>SQL features like joins , selects etc on top of Hadoop (before knowing
>about Hive) and we have derived some basic cost-models for join
>re-ordering which seem to be working fine on some basic scales of TPCH
>datasets .. Later I came to know about Hive and I am trying to
>implement the same in Hive .
>
>Right now I am in the process of understanding Hive's source and I am
>almost done with  "ql" package. I think it would be great if you guys
>can help us in this regard .. I am a bit confused about the
>implementation of joins and once i'm done with that , I can modify the
>"joinReorder" of Optimizer package by using the cost-formulae and
>metadata. It would be a great opportunity to work with you guys at fb
>and contribute to Hive..
>
>Thanks
>Bharath,V
>4th year Undergrad,IIIT Hyderabad.
>w: http://research.iiit.ac.in/~bharath.v
>
>On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nzhang@fb.com> wrote:
>> Hi Anja,
>>
>> As you noticed Hive only have limited supports for cost-baesd
>>optimization. One of the reasons is that Hive used to have very small
>>number of optional execution plans to choose from. One exception is
>>mapjoin vs common joins. Liying Tang had some work on his last intern to
>>convert common joins to mapjoin in a rule-based fashion. One of his
>>future works is to automatically convert common join to mapjoins based
>>on stats. There are also ongoing work on indexes on Hive. With the
>>support of indexes, CBO will be much needed.
>>
>> In order for a decent CBO to work, we need stats and cost models. There
>>are some work in stats. Table/partition level stats has already been
>>supported. There is a JIRA open for column level stats (HIVE-1362). Cost
>>model is much more complex in Hadoop environment and closely dependent
>>on the mapjoin/index implementations. Given al these in place, we can
>>then talk about plan enumeration etc.
>>
>> So yes, we are interested in CBO, but it is a large area and many
>>missing pieces need to be filled in Hive. If you have particular
>>interest in some area, you can propose your ideas in
>>hive-dev@hive.apache.org mailing list or even apply for an intern at FB
>>if you would like to work closely with us.
>>
>> Thanks,
>> Ning
>>
>> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>>
>>> Hi!
>>>
>>> I'm a graduate student from Georgia Tech and I'm working with Hive for
>>>a research project. I am interested in query optimization and the Hive
>>>MetaStore in that context. Working through the documentation and code,
>>>I noticed that the implementation right now is using a rule-based
>>>optimization system. Therefore, I was wondering whether cost-based
>>>query optimization will be a future task in the development of Hive and
>>>if it would be possible for me to cooperate with the developers of Hive
>>>to advance the project in general.
>>>
>>> Best regards,
>>> Anja Gruenheid
>>
>>


Mime
View raw message