tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: [DISCUSS] 0.8.0 release and next roadmap
Date Sat, 05 Apr 2014 15:54:37 GMT
Hi Keuntae,

I missed the push transmission. Actually, we scheduled it to 0.8.0,
but we haven't it due to other stability issues.

For those who don't know the background of push transmission, I'd like
to share its motivation. If we implement push transmission, we would
choose either pull or push according to contexts. To the best of my
knowledge, pull and push transmission approaches have different
advantages and disadvantages respectively. Basically, push is
efficient and faster in the environment where the cluster has abundant
resources, but it is likely to be less fault tolerant. Also, push
transmission requires gang scheduling; one stage and next stage should
be executed simultaneously. In contrast, pull transmission works well
even in the environment where cluster has poor resources. It has more
possibility to provide fault tolerance. Unlike push which requires
gang scheduling, the stages in pull transmission can be scheduled
independently.

In sum, the best approaches can be different according to cases. So,
It will be finally related to planning problem. We are expecting that
in some cases It will significantly reduce the query response times. I
expect that it would be achieved by a good amount of work. I'll file
it on the wiki.

HDFS 2.3.0 appears to start the support of heterogeneous storages
(HDFS-2832). I'm expecting that heterogeneous storages can be
supported as table space concept. The use of SSD for intermediate data
is available now. You can do it by setting SSD mount directories to
the property 'tajo.worker.tmpdir.locations'.

TAJO-104 still has lots of work.

Best regards,
Hyunsik


On Sat, Apr 5, 2014 at 12:45 AM, ktpark <sirpkt@gmail.com> wrote:
> Hi Hyunsik,
>
> I totally agree with you that next stage should focus on working with thousands of large
cluster nodes and many concurrent users.
> With many nodes, there must be abundant resources.
> So, how about considering implementation of push transmission(TAJO-291)?
>
> And I think Tajo needs to support different types of storage like SSD.
> Some possible usage of SSD can be cache or storage for intermediate data.
>
> I’m very interested in JIT Query Compilation and Vectorized Engine(TAJO-104),
> and As I know, they are under development for Tajo in C++.
> Do we have any road map to merge those?
>
> 2014. 4. 4., 오후 2:24, Hyunsik Choi <hyunsik@apache.org> 작성:
>
>> Hi folks,
>>
>> I'm very happy to see that our community is growing! Also, It's a pleasure
>> to discuss the Tajo 0.8.0 release. Recently, I've tested various features
>> in various contexts, and tried to figure out if there are any critical
>> problems. I think that there are only a few issues and we can release 0.8.0
>> next week. If there are further issues to be solved before the 0.8.0
>> release, feel free to suggest ideas.
>>
>> Also, I'd like to discuss our next roadmap. We are open to any suggestion
>> from users, contributors, and committers. Please fire away!
>>
>> I'm thinking that our next stage should focus on improving the way Tajo
>> runs in thousands of large cluster nodes and for a number of concurrent
>> users. The key issues associated with this include the following:
>>
>> * High availability
>> * Multi-tenancy scheduling
>> * More stability
>> * Improved shuffle
>>
>> The current work status is as follows. Min is working on Tajo's new
>> scheduler (TAJO-540) based on sparrow. I'll support him. As far as I know,
>> Alvin is working on TajoMaster HA (TAJO-704). Also, some guys including
>> myself are investigating and solving the issues which occur in large
>> clusters. These issues should be solved in order to make Tajo a complete
>> enterprise-ready production.
>>
>> In addition, there are some SQL feature support issues. Many analytic
>> problems require window functions. Also, in-subquery and scalar subquery
>> should be supported. So, I'd like to schedule them with high priority. In
>> my view, there will be very few SQL support issues if Tajo provides these
>> features.
>>
>> Besides those areas, David is working on a nested schema and its related
>> work (TAJO-710). I guess this will take quite a while because it requires a
>> lot of hard work. So, it would be great to schedule the nested schema
>> loosely. That's just my thoughts, anyhow.
>>
>> Aside from the discussion of our roadmap, I'd like to suggest that we need
>> to release more frequently after the 0.8.0 release. So far, there has been
>> a long period between each release because Tajo is undergoing heavy
>> development. By 'releasing early, releasing often', we will make more
>> tighter feedback loop between users and developers.
>>
>> I think that there are many additional many interesting issues to be
>> included in our roadmap. Feel free to suggest your idea. We will arrange
>> our short-term roadmap and long-term roadmap based on your suggestions.
>>
>> Thank you all so much for your contribution!
>>
>> Warm Regards,
>> Hyunsik
>

Mime
View raw message