Return-Path: X-Original-To: apmail-tajo-dev-archive@minotaur.apache.org Delivered-To: apmail-tajo-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9086510E15 for ; Mon, 6 Jan 2014 04:06:43 +0000 (UTC) Received: (qmail 2169 invoked by uid 500); 6 Jan 2014 04:06:27 -0000 Delivered-To: apmail-tajo-dev-archive@tajo.apache.org Received: (qmail 2106 invoked by uid 500); 6 Jan 2014 04:06:18 -0000 Mailing-List: contact dev-help@tajo.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tajo.incubator.apache.org Delivered-To: mailing list dev@tajo.incubator.apache.org Received: (qmail 2090 invoked by uid 99); 6 Jan 2014 04:06:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jan 2014 04:06:15 +0000 X-ASF-Spam-Status: No, hits=-2000.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 06 Jan 2014 04:06:14 +0000 Received: (qmail 1731 invoked by uid 99); 6 Jan 2014 04:05:51 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jan 2014 04:05:51 +0000 Date: Mon, 6 Jan 2014 04:05:51 +0000 (UTC) From: "Jihoon Son (JIRA)" To: dev@tajo.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862751#comment-13862751 ] Jihoon Son commented on TAJO-472: --------------------------------- Min, sorry about for my misunderstanding. Thanks to your additional comments, I finally understand your proposal. According to your proposal, a cached table is stored on HDFS with the hash partitioning for the reliability. Once a table is stored on HDFS, the tajo master selects a number of workers who cache the partitioned table into their memory. Thus, the data are pre-shuffled when workers finish to download partitioned data from HDFS into their local disks. I think that this is a good prototype for data cache. It is great that we build indexes on data packs. This is definitely necessary for partition pruning. However, we also have to consider the performance of the sequential scan. As you said, the index is useful only when the selectivity is quite low, and thus the sequential scan is useful for other cases. When the data packs have the same number of rows, their byte lengths are different according to their contents (or types). It means, more file opens and closes are required during the sequential scan if the size of a value is very small like byte. Definitely, it makes the sequential scan slower. How about change to have the same byte length for data packs? In addition, we need to balance the data distribution for fully utilizing the parallelism. Thus, when the tajo master selects workers, it should task account into the data distribution as well as workers' remaining resources. Thanks, Jihoon > Umbrella ticket for accelerating query speed through memory cached table > ------------------------------------------------------------------------ > > Key: TAJO-472 > URL: https://issues.apache.org/jira/browse/TAJO-472 > Project: Tajo > Issue Type: New Feature > Components: distributed query plan, physical operator > Reporter: Min Zhou > Assignee: Min Zhou > Attachments: TAJO-472 Proposal.pdf > > > Previously, I was involved as a technical expert into an in-memory database for on-line businesses in Alibaba group. That's an internal project, which can do group by aggregation on billions of rows in less than 1 second. > I'd like to apply this technology into tajo, make it much faster than it is. From some benchmark, we believe that spark&shark currently is the fastest solution among all the open source interactive query system , such as impala, presto, tajo. The main reason is that it benefit from in-memory data. > I will take memory cached table as my first step to accelerate query speed of tajo. Actually , this is the reason why I concerned at table partition during Xmas and new year holidays. > Will submit a proposal soon. > -- This message was sent by Atlassian JIRA (v6.1.5#6160)