hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qing Yan <qing...@gmail.com>
Subject Re: Hive vs. DryadLINQ
Date Sat, 17 Oct 2009 07:38:07 GMT
Hi Zheng,
    I second the idea of taking Dryad's architecture and apply it to
Hadoop.It will get the best of both worlds.The top part of Hive and the
bottom part
of Hadoop can be reused while refactoring Hadoop M/R layer to support
arbitrary operation, expanding Hive's DAG to cover
node level execution and finally integrating Hive and Hadoop together. I
think this is the right direction! Does this match your vision and how to
proceed?
Just start a new project or should there be more ppl in the community
involved to master planning/discuss this thing?


Regards,

Qing

On Fri, Oct 16, 2009 at 12:04 PM, Zheng Shao <zshao9@gmail.com> wrote:

> Hi Qing,
>
> Talking about high-level design and architecture, I think the ideas
> proposed in Hive will help SQL -> DryadLINQ translation as well.
>
> Hive internally translates the SQL query into a DAG plan which should fit
> Dryad - but with the limitation of Hadoop, we have to cut the DAG plan into
> separate map-reduce jobs.
> Also, as a side note, this paper from SOSP 2009:
> http://www.sigops.org/sosp/sosp09/papers/yu-sosp09.pdf (also from
> Microsoft) has the same idea as the hash-based aggregation in Hive.
>
> Nothing is blocking people from implementing the architecture of Hive on
> top of Dryad, and it should be as effective (just remove the last step that
> chops the plan into separate map-reduce jobs). But I do agree we won't be
> able to (and it does not make sense to) share the same code.
>
> So, we can either take the architecture of Hive and implement it on Dryad,
> or take the architecture of Dryad and implement it on Hadoop (NOTE: hadoop
> hdfs and map-reduce are broken apart now which makes it easier than ever to
> do that) and put Hive on top of that, just as Jeff mentioned. I do prefer
> the latter because Hadoop is a much widely accessible platform by both
> academia and industry.
>
> What do you think? Let us know if you want to start a project on this. It
> looks very interesting to me.
>
> Zheng
>
>
> On Thu, Oct 15, 2009 at 8:43 PM, Qing Yan <qingyan@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Actually I care less about Dryad's implementation - few people will adopt
>> it today due to its immature and/or proprietary nature. But strictly from
>> the design and architecture perspective, reading through their literature
>> makes one feel Dryad has certain edges over Hadoop/Hive.
>>
>> E.g. Hive treats Hadoop as an execution blackbox, say the hadoop job
>> involves a large dataset, if partial data error caused the job failure,
>> there is no easy way for Hive to know the situation and the whole job need
>> to be re-runned later, vs. in Dryad you get more control and fine tuning
>> opportunties.
>>
>> About the implementation of the Dryad model of query execution over HDFS
>> and underneath HiveQL, the question is
>>  how much dependency Hive has upon Map/Reduce..  It is probably difficult
>> to share the same translator/optimizer for Hadoop & Dryad without sacrafing
>> Dryad's capabilities.We can make Dryad operated only in M/R mode but why
>> bother:-P
>>
>>
>>
>> Regards
>>
>> Qing
>>
>>   On Fri, Oct 16, 2009 at 1:44 AM, Jeff Hammerbacher <hammer@cloudera.com
>> > wrote:
>>
>>> Hey Qing,
>>>
>>> You can download Dryad and see for yourself:
>>> http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
>>> accompanying distributed file system, unfortunately, and I've never seen a
>>> benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
>>> it's the "right" model for all workloads. There's certainly room for a
>>> richer set of physical operators in the Hadoop project, but the nice thing
>>> about Hadoop and Hive is that it's a full suite of storage, data flow
>>> execution, and a higher-level syntax that works today at scale. If you'd
>>> like to try your hand at an implementation of the Dryad model of query
>>> execution over HDFS and underneath HiveQL, that would certainly be an
>>> interesting project.
>>>
>>> Regards,
>>> Jeff
>>>
>>>
>>> On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan <qingyan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>    Has anyone looked into the Microsoft Dryad project?
>>>>
>>>>    Their basic idea is using DAG(connect computational "vertices" with
>>>> communication "edges") to model distributed computing flows. And they have
>>>> something called DryadLINQ which seems to be the Hive equivilent.
>>>>
>>>>      Since the DAG model doesn't distingish inter-job(workflow) and
>>>> intra-job(map/reduce..etc) layer, their approach of doing Query
>>>> translation,Workflow/Job Scheduling,Execution in one box may score better
>>>> optimization and fine tuning opportunties compared to the Hadoop/Hive
>>>> combo.
>>>>
>>>>    Also giving majority of the hard work will be encapsulated and
>>>> performed by the translation/optimizing layer, the simplicity
>>>> beauty of Map/Reduce becomes irrelevant or even hindrance because
>>>> it doesn't permit more generic and flexible
>>>> operations like Dryad does.
>>>>
>>>>
>>>>   Seems M$ got it right this time, at least on paper :-P ...thought?
>>>>
>>>>
>>>>
>>>>  Qing
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Yours,
> Zheng
>

Mime
View raw message