tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jihoon Son (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table
Date Sat, 04 Jan 2014 07:26:50 GMT

    [ https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862234#comment-13862234
] 

Jihoon Son edited comment on TAJO-472 at 1/4/14 7:25 AM:
---------------------------------------------------------

Min, thanks for your detailed comments. 
To reduce the cost of data shuffling, solutions which you suggest above are already implemented
in Tajo (ex, replicated join) or planned (ex, pull transfer model). 
Also, your last comment on the early data shuffling is definitely necessary. 
As soon as we are prepared, we should start the above works. 

Anyway, I'm very happy for that we point out the same problem. I investigated your proposal
including Inforbright's paper, and have a couple of things to discuss. (Actually, I roughly
read the paper. So, please forgive me if there are some misunderstandings for it.) 

First, cached tables are stored on HDFS with being partitioned by hash in your proposal. This
seems to handle the first case of join which you said above. However, the hash partitioning
is a logical layout in Tajo. That is, rows seem to be stored in multiple HDFS directories
according to their hash results, but its data are randomly distributed to HDFS datanodes.
Thus, I think that it is hard to pre-partition the cached tables by using the PARTITION BY
HASH clause. 

Second, you use a columnar file format which is like Inforbrights'. In the columnar format,
a table is horizontally partitioned into _row packs_, and then each row pack is stored as
multiple _data packs_ for each column. In other words, columns in a row pack are stored in
a separate data pack (see the first phrase of Section 2 of the paper). However, separated
data packs of a row pack might be stored in different datanodes in HDFS. This means that reading
of a row might incur a network communication to read column values stored in multiple datanodes.
To handle this problem, RCFile and Trevni break a table into multiple row groups, and make
each row group be placed in an HDFS block.  (Actually, multiple row groups can be in an HDFS
block in RCFile.) In CIF (1), a split directory (that is a kind of row pack) is stored as
multiple files for each column, but files of a split directory are placed to the same datanode
by implementing a new block placement policy of HDFS. So, I wonder how you handle this problem.

Third, each block(data pack) has an equivalent number of rows in your proposal. I'm wondering
why each block has the equivalent number of rows.

Fourth, what is the meaning of idle workers? Are they workers that are not selected to store
the cached table?

Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.

Thanks,
Jihoon

(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented
storage techniques for MapReduce. Proc. VLDB Endow. 4, 7 (April 2011), 419-429.
http://pages.cs.wisc.edu/~jignesh/publ/colMR.pdf


was (Author: jihoonson):
Min, thanks for your detailed comments. 
To reduce the cost of data shuffling, solutions which you suggest above are already implemented
in Tajo (ex, replicated join) or planned (ex, pull transfer model). 
Also, your last comment on the early data shuffling is definitely necessary. 
As soon as we are prepared, we should start the above works. 

Anyway, I'm very happy for that we point out the same problem. I investigated your proposal
including Inforbright's paper, and have a couple of things to discuss. (Actually, I roughly
read the paper. So, please forgive me if there are some misunderstandings for it.) 

First, cached tables are stored on HDFS with being partitioned by hash in your proposal. This
seems to handle the first case of join which you said above. However, the hash partitioning
is a logical layout in Tajo. That is, rows seem to be stored in multiple HDFS directories
according to their hash results, but its data are randomly distributed to HDFS datanodes.
Thus, I think that it is hard to pre-partition the cached tables by using the PARTITION BY
HASH clause. 

Second, you use a columnar file format which is like Inforbrights'. In the columnar format,
a table is horizontally partitioned into _row packs_, and then each row pack is stored as
multiple _data packs_ for each column. In other words, columns in a row pack are stored in
a separate data pack (see the first phrase of Section 2 of the paper). However, separated
data packs of a row pack might be stored in different datanodes in HDFS. This means that reading
of a row might incur a network communication to read column values stored in multiple datanodes.
To handle this problem, RCFile and Trevni break a table into multiple row groups, and make
each row group be placed in an HDFS block.  (Actually, multiple row groups can be in an HDFS
block in RCFile.) In CIF (1), a split directory (that is a kind of row pack) is stored as
multiple files for each column, but files of a split directory are placed to the same datanode
by implementing a new block placement policy of HDFS. So, I wonder how you handle this problem.

Third, each block(data pack) has an equivalent number of rows in your proposal. I'm wondering
why each block has the equivalent number of rows.

Fourth, what is the meaning of idle workers? Are they workers that are not selected to store
the cached table?

Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.

Thanks,
Jihoon

(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented
storage techniques for MapReduce. Proc. VLDB Endow. 4, 7 (April 2011), 419-429.

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database for on-line
businesses in Alibaba group. That's  an internal project, which can do group by aggregation
on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. From some
benchmark,  we believe that spark&shark currently is the fastest solution among all the
open source interactive query system , such as impala, presto, tajo.  The main reason is that
it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed of tajo.
Actually , this is the reason why I concerned at table partition during Xmas and new year
holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message