tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jihoon Son (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table
Date Fri, 03 Jan 2014 05:29:50 GMT

    [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861232#comment-13861232
] 

Jihoon Son commented on TAJO-472:
---------------------------------

Min,

the intermediate data which I meant is the shuffled(repartitioned) data. We can easily imagine
the case of when we need to cache the shuffled data instead of the original input table. As
you know, the data repartition cost is the one of the most important factors of the query
processing performance. I think that we can reduce the repartition cost by caching the repartitioned
intermediate data.

It looks reasonable on using the md5 match to avoid recompute the cached results, and I also
agree on supporting both ways of the manual caching and the automatic caching. 

Your proposal is very interesting. I'll deeply investigate the proposal.
Thanks!

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database for on-line
businesses in Alibaba group. That's  an internal project, which can do group by aggregation
on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. From some
benchmark,  we believe that spark&shark currently is the fastest solution among all the
open source interactive query system , such as impala, presto, tajo.  The main reason is that
it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed of tajo.
Actually , this is the reason why I concerned at table partition during Xmas and new year
holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message