tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jihoon Son (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table
Date Mon, 06 Jan 2014 04:05:51 GMT

    [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862751#comment-13862751

Jihoon Son commented on TAJO-472:

Min, sorry about for my misunderstanding. Thanks to your additional comments, I finally understand
your proposal.
According to your proposal, a cached table is stored on HDFS with the hash partitioning for
the reliability. Once a table is stored on HDFS, the tajo master selects a number of workers
who cache the partitioned table into their memory. Thus, the data are pre-shuffled when workers
finish to download partitioned data from HDFS into their local disks. I think that this is
a good prototype for data cache. 

It is great that we build indexes on data packs. This is definitely necessary for partition
pruning. However, we also have to consider the performance of the sequential scan. As you
said, the index is useful only when the selectivity is quite low, and thus the sequential
scan is useful for other cases. When the data packs have the same number of rows, their byte
lengths are different according to their contents (or types). It means, more file opens and
closes are required during the sequential scan if the size of a value is very small like byte.
Definitely, it makes the sequential scan slower. How about change to have the same byte length
for data packs?

In addition, we need to balance the data distribution for fully utilizing the parallelism.
Thus, when the tajo master selects workers, it should task account into the data distribution
as well as workers' remaining resources. 


> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
> Previously, I was involved as a technical expert into an in-memory database for on-line
businesses in Alibaba group. That's  an internal project, which can do group by aggregation
on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. From some
benchmark,  we believe that spark&shark currently is the fastest solution among all the
open source interactive query system , such as impala, presto, tajo.  The main reason is that
it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed of tajo.
Actually , this is the reason why I concerned at table partition during Xmas and new year
> Will submit a proposal soon.

This message was sent by Atlassian JIRA

View raw message