tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Min Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table
Date Fri, 03 Jan 2014 03:37:53 GMT

    [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861144#comment-13861144
] 

Min Zhou edited comment on TAJO-472 at 1/3/14 3:36 AM:
-------------------------------------------------------

Hi Jihoon Son,

If you create a table with a postfix "cache" on the table name, shark will load the table
into a RDD object in memory. I don't think spark can cache intermediate data. 

I don't know what you meant on the intermediate data.  Is it a temporary table or the shuffle
data?  If we don't support sub-queries, I think intermediate data is not quite common. Reuse
a cached table is straightforward. The problem is how to determine a subquery do the same
job as another query or its subquery. Actually, I did a job on hive in my previous company
much like cache table :)  We stores the intermediate tables produced by subqueries of SQL,
and do a md5sum on those subqueries' serialized plan ,  after than the md5 value will be stored
into metadata. The subqueries in a subsequent query will be calculated into other md5, if
those md5 match one of the value in metadata, simply load the intermediate data without recompute
it.

Currently, I am thinking about manually cache tables like the way takes in shark.  Holding
an histogram of those data is a good approach, however,  it need some efforts. AFAIK, there
is a role who can decide which table it should be cached. This role would be the cluster administrator,
or data warehouse architect.  Sometimes it should works better in the real world than automatically
due to those people always know which tables are hot and which tables are critical for their
business thus need a quicker speed while automation can't guarantee the SLA.  Ideally, I'd
like support both way and make it as an option.  

   


was (Author: coderplay):
Hi Jihoon Son,

If you create a table with a postfix "cache" on the table name, shark will load the table
into a RDD object in memory. I don't think spark can cache intermediate data. 

I don't know what you meant on the intermediate data.  Is it a temporary table or the shuffle
data?  If we don't support sub-queries, I think intermediate data is not quite common. Reuse
a cached table is straightforward. The problem is how to determine a subquery do the same
job as another query or its subquery. Actually, I did a job on hive in my previous company
much like cache table :)  We stores the intermediate tables produced by subqueries of SQL,
and do a md5sum on those subqueries' serialized plan ,  after than the md5 value will be stored
into metadata. The subqueries in a subsequent query will be calculated into other md5, if
those md5 match one of the value in metadata, simply load the intermediate data without recompute
it.

Currently, I am thinking about manually cache tables like the way takes in shark.  Holding
an histogram of those data is a good approach, however,  it need some efforts. AFAIK, there
is a role who can decide which table it should be cached. This role would be the cluster administrator,
or data warehouse architect.  Sometimes it should works better in the real world than automatically
due to automation can't guarantee the SLA.  Ideally, I'd like support both way and make it
as an option.  

   

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database for on-line
businesses in Alibaba group. That's  an internal project, which can do group by aggregation
on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. From some
benchmark,  we believe that spark&shark currently is the fastest solution among all the
open source interactive query system , such as impala, presto, tajo.  The main reason is that
it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed of tajo.
Actually , this is the reason why I concerned at table partition during Xmas and new year
holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message