Return-Path: X-Original-To: apmail-tajo-dev-archive@minotaur.apache.org Delivered-To: apmail-tajo-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED72110654 for ; Fri, 3 Jan 2014 05:30:25 +0000 (UTC) Received: (qmail 88487 invoked by uid 500); 3 Jan 2014 05:30:21 -0000 Delivered-To: apmail-tajo-dev-archive@tajo.apache.org Received: (qmail 88450 invoked by uid 500); 3 Jan 2014 05:30:15 -0000 Mailing-List: contact dev-help@tajo.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tajo.incubator.apache.org Delivered-To: mailing list dev@tajo.incubator.apache.org Received: (qmail 88441 invoked by uid 99); 3 Jan 2014 05:30:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 05:30:12 +0000 X-ASF-Spam-Status: No, hits=-2000.4 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 03 Jan 2014 05:30:10 +0000 Received: (qmail 88372 invoked by uid 99); 3 Jan 2014 05:29:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 05:29:50 +0000 Date: Fri, 3 Jan 2014 05:29:50 +0000 (UTC) From: "Jihoon Son (JIRA)" To: dev@tajo.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861232#comment-13861232 ] Jihoon Son commented on TAJO-472: --------------------------------- Min, the intermediate data which I meant is the shuffled(repartitioned) data. We can easily imagine the case of when we need to cache the shuffled data instead of the original input table. As you know, the data repartition cost is the one of the most important factors of the query processing performance. I think that we can reduce the repartition cost by caching the repartitioned intermediate data. It looks reasonable on using the md5 match to avoid recompute the cached results, and I also agree on supporting both ways of the manual caching and the automatic caching. Your proposal is very interesting. I'll deeply investigate the proposal. Thanks! > Umbrella ticket for accelerating query speed through memory cached table > ------------------------------------------------------------------------ > > Key: TAJO-472 > URL: https://issues.apache.org/jira/browse/TAJO-472 > Project: Tajo > Issue Type: New Feature > Components: distributed query plan, physical operator > Reporter: Min Zhou > Assignee: Min Zhou > Attachments: TAJO-472 Proposal.pdf > > > Previously, I was involved as a technical expert into an in-memory database for on-line businesses in Alibaba group. That's an internal project, which can do group by aggregation on billions of rows in less than 1 second. > I'd like to apply this technology into tajo, make it much faster than it is. From some benchmark, we believe that spark&shark currently is the fastest solution among all the open source interactive query system , such as impala, presto, tajo. The main reason is that it benefit from in-memory data. > I will take memory cached table as my first step to accelerate query speed of tajo. Actually , this is the reason why I concerned at table partition during Xmas and new year holidays. > Will submit a proposal soon. > -- This message was sent by Atlassian JIRA (v6.1.5#6160)