tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jihoon Son (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table
Date Fri, 03 Jan 2014 02:11:51 GMT

    [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861100#comment-13861100

Jihoon Son commented on TAJO-472:

Min, it's a good idea, but I have some questions.
Actually, I also considered applying the cache approach of Spark/Shark to Tajo. In those systems,
users should specify the data being cached. This approach is useful for Spark, because the
query execution plan is made by the users. That is, users can cache any data including intermediate
data as well as the input tables. 
However, as far as I know, only tables can be cached in Shark. This means that we lose the
chance to cache the intermediate data. So, my first question is that you have any solutions
to handle this limitation.

The second question is that the automatic cache mechanism is possible. The cache approach
of Spark/Shark requires for users to have a deep understanding for the query execution plan.
I think that this is a critical limitation of their approach, because users might not have
any backgrounds for the query processing.

In my opinion, it is the best solution that Tajo automatically gathers the information of
frequently used data including the input tables and intermediate data, and cache them into

I'll wait for your reply.

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
> Previously, I was involved as a technical expert into an in-memory database for on-line
businesses in Alibaba group. That's  an internal project, which can do group by aggregation
on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. From some
benchmark,  we believe that spark&shark currently is the fastest solution among all the
open source interactive query system , such as impala, presto, tajo.  The main reason is that
it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed of tajo.
Actually , this is the reason why I concerned at table partition during Xmas and new year
> Will submit a proposal soon.

This message was sent by Atlassian JIRA

View raw message