airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-2452) Document field_dict for HiveCliHook.load_file must be OrderedDict
Date Tue, 15 May 2018 17:54:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16476259#comment-16476259
] 

ASF subversion and git services commented on AIRFLOW-2452:
----------------------------------------------------------

Commit 648b14b4d95bf3aca26e8b54ffe8585b52efc8fd in incubator-airflow's branch refs/heads/master
from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=648b14b ]

[AIRFLOW-2452] Document field_dict must be OrderedDict

HiveCliHook.load_file has a parameter called
field_dict, which defines name-type pairs
for columns, must be OrderedDict so as to
keep columns' order, but it's undocumented.
This PR adds an note about that, and fixes
HiveCliHook.load_df function which calls
load_file internally.

Closes #3347 from sekikn/AIRFLOW-2452


> Document field_dict for HiveCliHook.load_file must be OrderedDict
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2452
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2452
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: docs, Documentation, hive_hooks, hooks
>            Reporter: Kengo Seki
>            Assignee: Kengo Seki
>            Priority: Major
>             Fix For: 2.0.0
>
>
> HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs
for columns, must be OrderedDict. If not, users can get unexpected result. Example:
> Given the following input file:
> {code}
> $ head /tmp/baby_names.csv
> 1880,John,0.081541,boy
> 1880,William,0.080511,boy
> 1880,James,0.050057,boy
> 1880,Charles,0.045167,boy
> 1880,George,0.043292,boy
> 1880,Frank,0.02738,boy
> 1880,Joseph,0.022229,boy
> 1880,Thomas,0.021401,boy
> 1880,Henry,0.020641,boy
> {code}
> Load the file via HiveCliHook.load_file with field_dict as a normal dict:
> {code}
> In [1]: from airflow.hooks.hive_hooks import HiveCliHook
> In [2]: hook = HiveCliHook()
> [2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost
> In [3]: field_dict = {
>    ...:     "year": "INT",
>    ...:     "name": "STRING",
>    ...:     "pct": "DOUBLE",
>    ...:     "sex": "STRING",
>    ...: }
> In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict,
recreate=True)
> [2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names;
> CREATE TABLE IF NOT EXISTS baby_names (
> sex STRING,
>     name STRING,
>     pct DOUBLE,
>     year INT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS textfile
> ;
> (snip)
> [2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats:
[numFiles=1, numRows=0, totalSize=1289, rawDataSize=0]
> [2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK
> [2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds
> {code}
> The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded
data is not correctly selected from Hive:
> {code}
> hive> SELECT * FROM baby_names LIMIT 10;
> OK
> 1880    John    0.081541        NULL
> 1880    William 0.080511        NULL
> 1880    James   0.050057        NULL
> 1880    Charles 0.045167        NULL
> 1880    George  0.043292        NULL
> 1880    Frank   0.02738 NULL
> 1880    Joseph  0.022229        NULL
> 1880    Thomas  0.021401        NULL
> 1880    Henry   0.020641        NULL
> 1880    Robert  0.020404        NULL
> Time taken: 2.465 seconds, Fetched: 10 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message