tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: [DISCUSS] 0.8.0 release and next roadmap
Date Fri, 04 Apr 2014 13:19:53 GMT
Hi Min,

> I'd like to see tajo can run on a Yarn cluster. This is quite useful for sharing data
with other distributed systems, like mapreduce, spark.

Yes, I missed Yarn! Thank you for suggesting it. We cannot postpone to
support Yarn. In my view, Llima or Slider would be a nice candidate in
this time in order to deploy a Tajo instance in a Yarn cluster. We
need to schedule it to our short term roadmap. How do you think about
it?

> Besides that, I think basic user authentication like hadoop's UserGroupInformation is
useful for multi-users sharing a tajo cluster's computing capacity.

I agree with this idea. I'll file Yarn and UserGroupInformation on
multi-tenant category in our roadmap.

> Seems I added more work to do, can we internally release some sprints? After the sprint,
we can fire an official release?

We can make an official release after the sprint. I intended it.

> Regarding to shuffle, do you have any proposal to improve it? Could you just drop a few
lines to show your opinion here?

The main issue about shuffle is that, like ealier MR and Spark, too
many small files are created during shuffle phase. This approach
results in many random I/O and give a not trivial burden to operating
system. Consequently, this approach also limits scalability and is not
efficient. As you know, the typical solution is to make a consolidated
file (sorted and grouped in shuffle keys) per task with a simple
index. As far as I know, MR and Spark do in the manner. In addition,
OS cache utilization of intermediate data, and smart scheduling
between writing and fetching are would be helpful to improve the
current shuffle approach.

Thanks,
Hyunsik

On Fri, Apr 4, 2014 at 2:56 PM, Min Zhou <coderplay@gmail.com> wrote:
> Hi Hyunsik,
>
> I'd like to see tajo can run on a Yarn cluster. This is quite useful for
> sharing data with other distributed systems, like mapreduce, spark.
>
> Besides that, I think basic user authentication like hadoop's
> UserGroupInformation is useful for multi-users sharing a tajo cluster's
> computing capacity.
>
> The above 2 it's both a part of multi-tenancy support.
>
> Seems I added more work to do, can we internally release some sprints?
> After the sprint, we can fire an official release?
>
> Regarding to shuffle, do you have any proposal to improve it? Could you
> just drop a few lines to show your opinion here?
>
>
>
>
> Min
>
>
> On Thu, Apr 3, 2014 at 10:24 PM, Hyunsik Choi <hyunsik@apache.org> wrote:
>
>> Hi folks,
>>
>> I'm very happy to see that our community is growing! Also, It's a pleasure
>> to discuss the Tajo 0.8.0 release. Recently, I've tested various features
>> in various contexts, and tried to figure out if there are any critical
>> problems. I think that there are only a few issues and we can release 0.8.0
>> next week. If there are further issues to be solved before the 0.8.0
>> release, feel free to suggest ideas.
>>
>> Also, I'd like to discuss our next roadmap. We are open to any suggestion
>> from users, contributors, and committers. Please fire away!
>>
>> I'm thinking that our next stage should focus on improving the way Tajo
>> runs in thousands of large cluster nodes and for a number of concurrent
>> users. The key issues associated with this include the following:
>>
>> * High availability
>> * Multi-tenancy scheduling
>> * More stability
>> * Improved shuffle
>>
>> The current work status is as follows. Min is working on Tajo's new
>> scheduler (TAJO-540) based on sparrow. I'll support him. As far as I know,
>> Alvin is working on TajoMaster HA (TAJO-704). Also, some guys including
>> myself are investigating and solving the issues which occur in large
>> clusters. These issues should be solved in order to make Tajo a complete
>> enterprise-ready production.
>>
>> In addition, there are some SQL feature support issues. Many analytic
>> problems require window functions. Also, in-subquery and scalar subquery
>> should be supported. So, I'd like to schedule them with high priority. In
>> my view, there will be very few SQL support issues if Tajo provides these
>> features.
>>
>> Besides those areas, David is working on a nested schema and its related
>> work (TAJO-710). I guess this will take quite a while because it requires a
>> lot of hard work. So, it would be great to schedule the nested schema
>> loosely. That's just my thoughts, anyhow.
>>
>> Aside from the discussion of our roadmap, I'd like to suggest that we need
>> to release more frequently after the 0.8.0 release. So far, there has been
>> a long period between each release because Tajo is undergoing heavy
>> development. By 'releasing early, releasing often', we will make more
>> tighter feedback loop between users and developers.
>>
>> I think that there are many additional many interesting issues to be
>> included in our roadmap. Feel free to suggest your idea. We will arrange
>> our short-term roadmap and long-term roadmap based on your suggestions.
>>
>> Thank you all so much for your contribution!
>>
>> Warm Regards,
>> Hyunsik
>>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com

Mime
View raw message