tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Min Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TAJO-283) Add Table Partitioning
Date Mon, 23 Dec 2013 18:25:54 GMT

    [ https://issues.apache.org/jira/browse/TAJO-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855821#comment-13855821

Min Zhou commented on TAJO-283:

That's absolutely a good question!
I have thought about this problem. Firstly, we should figure out how large the number of partitions
is acceptable. From my experience, MySQL works well if we insert thousands of rows in a time,
even tens of thousands are still acceptable. But if the order of magnitude grows to hundreds
of thousands , even millions or more, MySQL would be very slow when inserting&retrieving
those records. 
When we are using HASH partition,  since we can defined the buckets number of hash function,
I think the number is under control. Normally it should be tens or hundreds . For RANGE and
LIST partition,  it works as well due to the partitions is enumerable. The worst situation
I think is when we are using  COLUMN partitions on a table,  which is quite similar with hive's
dynamic partition list below.
CREATE TABLE dst_tbl (key int, value string) PARTITIONED BY (col1 string, col2 it) AS
SELECT key, value,  col1, col2 FROM src_tbl
Query users always have no knowledge about this table's  value distribution. If the table
is with high cardinality (a.k.a with so many distinct values), that should be a disaster for
the below area
1. The number of files/directories on hdfs would be very large, big pressure for HDFS namenode's
2. As you mentioned, this would be a big problem for catalog.

Acutally, due to the above reasons. In Alibaba.com, my previous employer, which has one of
the largest single hadoop cluster in the world, we disabled dynamic partitioning.  I think
 you should run into the same problem when you are using column partitioning.  I don't know
why you guys decide to support such feature, could you give me some background about it? How
can we benefit from column partitions?

[~hyunsik] [~jihoonson] 
Thank you. Merry Christmats!


> Add Table Partitioning
> ----------------------
>                 Key: TAJO-283
>                 URL: https://issues.apache.org/jira/browse/TAJO-283
>             Project: Tajo
>          Issue Type: New Feature
>          Components: catalog, physical operator, planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
> Table partitioning gives many facilities to maintain large tables. First of all, it enables
the data management system to prune many input data which are actually not necessary. In addition,
it gives the system more optimization  opportunities  that exploit the physical layouts.
> Basically, Tajo should follow the RDBMS-style partitioning system, including range, list,
hash, and so on. In order to keep Hive compatibility, we need to add Hive partition type that
does not exists in existing DBMS systems.

This message was sent by Atlassian JIRA

View raw message