hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: External partitoned table based on yyyy/mm/dd HDFS structure
Date Fri, 03 Jun 2016 09:22:06 GMT
1.
If the directory name is in the format of key=val
The partition column name should be key

e.g.
/user/igor/data/dt=2016-06-02
create external table t (i int) partitioned by (dt date) location '/user/igor/data/';

2.
I would have used msck repair table t for ad-hoc operations.
It scans the whole table HDFS tree and if you have a lot of directories it might be costly.
I would suggest to add “Alter table t add partition …” to the process that creates the
new directories and adds the data.

3.
Partitioning:


·         Metadata performance wise, you should strive to create the minimum number of partitions.



·         Query performance wise, you should strive to partitions` granularity that matches
your common queries

o   If you usually select whole years, create a yearly partitions

o   If you usually select whole months, create monthly partitions

o   If you usually select few days, create daily partitions



In addition, partitions should be big enough to have a performance advantage.

Don’t partition small tables.



·         Maintenance performance wise, your partitions should be small enough to be handled
by operations such as

ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;

in a reasonable time.



Common practice is a daily partition
~365 days * 10 years = 3,650 partitions, which is O.K.
Try not to generate more than few thousands partitions




From: Igor Kravzov [mailto:igork.inexso@gmail.com]
Sent: Thursday, June 02, 2016 5:55 PM
To: user@hive.apache.org
Subject: Re: External partitoned table based on yyyy/mm/dd HDFS structure

Thanks Dudu for the great explanation.
I was doing some reading and thinking instead of complicated hierarchical structure to have
flat one.
Like

/user/igor/data/date=2016-06-02
create external table t (i int) partitioned by (yyyymmdd date) location '/user/igor/data/';
 or
 /user/igor/date=20160602
create external table t (i int) partitioned by (yyyymmdd int) location '/user/igor/data/';
Will it work?

Also I will need to schedule msck repair table t; if I want partitions automatically picked
up. Hive does not have this feature. Correct?

What is the optimal directory size for a partition? Is about 2GB OK?


On Wed, Jun 1, 2016 at 4:38 PM, Markovitz, Dudu <dmarkovitz@paypal.com<mailto:dmarkovitz@paypal.com>>
wrote:
The short answer:
In this naming convention it will require to specifically define each partition.
If the naming convention was yyyy=2016/mm=11/dd=28 instead of 2016/11/28 it would have been
straight forward.

Dudu

+ The long answer:

------------------------------------------------------------------------------------------------------------------------
-- bash
------------------------------------------------------------------------------------------------------------------------

mkdir t
mkdir t/2015
mkdir t/2015/01
mkdir t/2015/01/22
mkdir t/2015/01/23
mkdir t/2015/02
mkdir t/2015/02/17
mkdir t/2015/03
mkdir t/2015/03/04
mkdir t/2015/03/05
mkdir t/2015/03/06
mkdir t/2016
mkdir t/2016/10
mkdir t/2016/10/01
mkdir t/2016/10/02
mkdir t/2016/10/03
mkdir t/2016/11
mkdir t/2016/11/27
mkdir t/2016/11/28


echo -e "1\n2\n3"               > t/2015/01/22/data.txt
echo -e "4"                     > t/2015/01/23/data.txt
echo -e "5\n6"                  > t/2015/02/17/data.txt
echo -e "7\n8\n9"               > t/2015/03/04/data.txt
echo -e "10"                    > t/2015/03/05/data.txt
echo -e "11\n12"                > t/2015/03/06/data.txt
echo -e "13\n14"                > t/2016/10/01/data.txt
echo -e "15\n16\n17\n18\n19"    > t/2016/10/02/data.txt
echo -e "20\n21"                > t/2016/10/03/data.txt
echo -e "22"                    > t/2016/11/27/data.txt
echo -e "23\n24\n25"            > t/2016/11/28/data.txt

hdfs dfs -put t /user/dmarkovitz/t


t
├── 2015
│   ├── 01
│   │   ├── 22
│   │   │   └── data.txt
│   │   └── 23
│   │       └── data.txt
│   ├── 02
│   │   └── 17
│   │       └── data.txt
│   └── 03
│       ├── 04
│       │   └── data.txt
│       ├── 05
│       │   └── data.txt
│       └── 06
│           └── data.txt
└── 2016
    ├── 10
    │   ├── 01
    │   │   └── data.txt
    │   ├── 02
    │   │   └── data.txt
    │   └── 03
    │       └── data.txt
    └── 11
        ├── 27
        │   └── data.txt
        └── 28
            └── data.txt


------------------------------------------------------------------------------------------------------------------------
-- hive
------------------------------------------------------------------------------------------------------------------------

set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

------------------------------------------------------------------------------------------------------------------------
-- t1:     no partitions
------------------------------------------------------------------------------------------------------------------------

create external table t1 (i int) location '/user/dmarkovitz/t';

select i,input__file__name from t1;

i    input__file__name
1    hdfs://horton/user/dmarkovitz/t/2015/01/22/data.txt
2    hdfs://horton/user/dmarkovitz/t/2015/01/22/data.txt
3    hdfs://horton/user/dmarkovitz/t/2015/01/22/data.txt
4    hdfs://horton/user/dmarkovitz/t/2015/01/23/data.txt
5    hdfs://horton/user/dmarkovitz/t/2015/02/17/data.txt
6    hdfs://horton/user/dmarkovitz/t/2015/02/17/data.txt
7    hdfs://horton/user/dmarkovitz/t/2015/03/04/data.txt
8    hdfs://horton/user/dmarkovitz/t/2015/03/04/data.txt
9    hdfs://horton/user/dmarkovitz/t/2015/03/04/data.txt
10   hdfs://horton/user/dmarkovitz/t/2015/03/05/data.txt
11   hdfs://horton/user/dmarkovitz/t/2015/03/06/data.txt
12   hdfs://horton/user/dmarkovitz/t/2015/03/06/data.txt
13   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt
14   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt
15   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt
16   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt
17   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt
18   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt
19   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt
20   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt
21   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt
22   hdfs://horton/user/dmarkovitz/t/2016/11/27/data.txt
23   hdfs://horton/user/dmarkovitz/t/2016/11/28/data.txt
24   hdfs://horton/user/dmarkovitz/t/2016/11/28/data.txt
25   hdfs://horton/user/dmarkovitz/t/2016/11/28/data.txt

------------------------------------------------------------------------------------------------------------------------
-- t2:     3 partition columns (yyyy smallint,mm tinyint,dd tinyint)
------------------------------------------------------------------------------------------------------------------------

create external table t2 (i int) partitioned by (yyyy smallint,mm tinyint,dd tinyint) location
'/user/dmarkovitz/t';

alter table t2 add if not exists partition (yyyy=2015,mm=01,dd=22) location '/user/dmarkovitz/t/2015/01/22';
alter table t2 add if not exists partition (yyyy=2015,mm=01,dd=23) location '/user/dmarkovitz/t/2015/01/23';
alter table t2 add if not exists partition (yyyy=2015,mm=02,dd=17) location '/user/dmarkovitz/t/2015/02/17';
alter table t2 add if not exists partition (yyyy=2015,mm=03,dd=04) location '/user/dmarkovitz/t/2015/03/04';
alter table t2 add if not exists partition (yyyy=2015,mm=03,dd=05) location '/user/dmarkovitz/t/2015/03/05';
alter table t2 add if not exists partition (yyyy=2015,mm=03,dd=06) location '/user/dmarkovitz/t/2015/03/06';
alter table t2 add if not exists partition (yyyy=2016,mm=10,dd=01) location '/user/dmarkovitz/t/2016/10/01';
alter table t2 add if not exists partition (yyyy=2016,mm=10,dd=02) location '/user/dmarkovitz/t/2016/10/02';
alter table t2 add if not exists partition (yyyy=2016,mm=10,dd=03) location '/user/dmarkovitz/t/2016/10/03';
alter table t2 add if not exists partition (yyyy=2016,mm=11,dd=27) location '/user/dmarkovitz/t/2016/11/27';
alter table t2 add if not exists partition (yyyy=2016,mm=11,dd=28) location '/user/dmarkovitz/t/2016/11/28';


select i,input__file__name,yyyy,mm,dd from t2 where yyyy=2016 and mm=10;

i    input__file__name    yyyy mm   dd
13   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt  2016 10   1
14   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt  2016 10   1
15   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016 10   2
16   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016 10   2
17   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016 10   2
18   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016 10   2
19   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016 10   2
20   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt  2016 10   3
21   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt  2016 10   3

------------------------------------------------------------------------------------------------------------------------
-- t3:     1 partition column (yyyymmdd date)
------------------------------------------------------------------------------------------------------------------------

create external table t3 (i int) partitioned by (yyyymmdd date) location '/user/dmarkovitz/t';

alter table t3 add if not exists partition (yyyymmdd=date '2015-01-22') location '/user/dmarkovitz/t/2015/01/22';
alter table t3 add if not exists partition (yyyymmdd=date '2015-01-23') location '/user/dmarkovitz/t/2015/01/23';
alter table t3 add if not exists partition (yyyymmdd=date '2015-02-17') location '/user/dmarkovitz/t/2015/02/17';
alter table t3 add if not exists partition (yyyymmdd=date '2015-03-04') location '/user/dmarkovitz/t/2015/03/04';
alter table t3 add if not exists partition (yyyymmdd=date '2015-03-05') location '/user/dmarkovitz/t/2015/03/05';
alter table t3 add if not exists partition (yyyymmdd=date '2015-03-06') location '/user/dmarkovitz/t/2015/03/06';
alter table t3 add if not exists partition (yyyymmdd=date '2016-10-01') location '/user/dmarkovitz/t/2016/10/01';
alter table t3 add if not exists partition (yyyymmdd=date '2016-10-02') location '/user/dmarkovitz/t/2016/10/02';
alter table t3 add if not exists partition (yyyymmdd=date '2016-10-03') location '/user/dmarkovitz/t/2016/10/03';
alter table t3 add if not exists partition (yyyymmdd=date '2016-11-27') location '/user/dmarkovitz/t/2016/11/27';
alter table t3 add if not exists partition (yyyymmdd=date '2016-11-28') location '/user/dmarkovitz/t/2016/11/28';


select i,input__file__name,yyyymmdd from t3 where yyyymmdd between date '2016-10-01' and date
'2016-10-30';

i    input__file__name    yyyymmdd
13   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt  2016-10-01
14   hdfs://horton/user/dmarkovitz/t/2016/10/01/data.txt  2016-10-01
15   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016-10-02
16   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016-10-02
17   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016-10-02
18   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016-10-02
19   hdfs://horton/user/dmarkovitz/t/2016/10/02/data.txt  2016-10-02
20   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt  2016-10-03
21   hdfs://horton/user/dmarkovitz/t/2016/10/03/data.txt  2016-10-03

------------------------------------------------------------------------------------------------------------------------
-- Automatic partitions handling
------------------------------------------------------------------------------------------------------------------------

There is a specific directories naming convention that is supported by Hive partitioning which
is {partition column}={partition value}

hdfs dfs -mv /user/dmarkovitz/t/2015/01/22  /user/dmarkovitz/t/2015/01/dd=22
hdfs dfs -mv /user/dmarkovitz/t/2015/01/23  /user/dmarkovitz/t/2015/01/dd=23
hdfs dfs -mv /user/dmarkovitz/t/2015/02/17  /user/dmarkovitz/t/2015/02/dd=17
hdfs dfs -mv /user/dmarkovitz/t/2015/03/04  /user/dmarkovitz/t/2015/03/dd=04
hdfs dfs -mv /user/dmarkovitz/t/2015/03/05  /user/dmarkovitz/t/2015/03/dd=05
hdfs dfs -mv /user/dmarkovitz/t/2015/03/06  /user/dmarkovitz/t/2015/03/dd=06
hdfs dfs -mv /user/dmarkovitz/t/2016/10/01  /user/dmarkovitz/t/2016/10/dd=01
hdfs dfs -mv /user/dmarkovitz/t/2016/10/02  /user/dmarkovitz/t/2016/10/dd=02
hdfs dfs -mv /user/dmarkovitz/t/2016/10/03  /user/dmarkovitz/t/2016/10/dd=03
hdfs dfs -mv /user/dmarkovitz/t/2016/11/27  /user/dmarkovitz/t/2016/11/dd=27
hdfs dfs -mv /user/dmarkovitz/t/2016/11/28  /user/dmarkovitz/t/2016/11/dd=28
hdfs dfs -mv /user/dmarkovitz/t/2015/01     /user/dmarkovitz/t/2015/mm=01
hdfs dfs -mv /user/dmarkovitz/t/2015/02     /user/dmarkovitz/t/2015/mm=02
hdfs dfs -mv /user/dmarkovitz/t/2015/03     /user/dmarkovitz/t/2015/mm=03
hdfs dfs -mv /user/dmarkovitz/t/2016/10     /user/dmarkovitz/t/2016/mm=10
hdfs dfs -mv /user/dmarkovitz/t/2016/11     /user/dmarkovitz/t/2016/mm=11
hdfs dfs -mv /user/dmarkovitz/t/2015        /user/dmarkovitz/t/yyyy=2015
hdfs dfs -mv /user/dmarkovitz/t/2016        /user/dmarkovitz/t/yyyy=2016

t
├── yyyy=2015
│   ├── mm=01
│   │   ├── dd=22
│   │   │   └── data.txt
│   │   └── dd=23
│   │       └── data.txt
│   ├── mm=02
│   │   └── dd=17
│   │       └── data.txt
│   └── mm=03
│       ├── dd=04
│       │   └── data.txt
│       ├── dd=05
│       │   └── data.txt
│       └── dd=06
│           └── data.txt
└── yyyy=2016
    ├── mm=10
    │   ├── dd=01
    │   │   └── data.txt
    │   ├── dd=02
    │   │   └── data.txt
    │   └── dd=03
    │       └── data.txt
    └── mm=11
        ├── dd=27
        │   └── data.txt
        └── dd=28
            └── data.txt


create external table t (i int) partitioned by (yyyy smallint,mm tinyint,dd tinyint) location
'/user/dmarkovitz/t';
msck repair table t;

select i,input__file__name,yyyy,mm,dd from t where yyyy >= 2016 and mm >= 10;

i    input__file__name    yyyy mm   dd
13   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=01/data.txt  2016 10     1
14   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=01/data.txt  2016 10     1
15   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=02/data.txt  2016 10     2
16   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=02/data.txt  2016 10     2
17   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=02/data.txt  2016 10     2
18   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=02/data.txt  2016 10     2
19   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=02/data.txt  2016 10     2
20   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=03/data.txt  2016 10     3
21   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=10/dd=03/data.txt  2016 10     3
22   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=11/dd=27/data.txt  2016 11     27
23   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=11/dd=28/data.txt  2016 11     28
24   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=11/dd=28/data.txt  2016 11     28
25   hdfs://horton/user/dmarkovitz/t/yyyy=2016/mm=11/dd=28/data.txt  2016 11     28


From: Igor Kravzov [mailto:igork.inexso@gmail.com<mailto:igork.inexso@gmail.com>]
Sent: Wednesday, June 01, 2016 6:03 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: External partitoned table based on yyyy/mm/dd HDFS structure

Is it possible to create?
What would be the table definition in this case?

Mime
View raw message