hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Takacs <hungarian...@hotmail.com>
Subject insert data into hadoop / hive cluster
Date Thu, 27 Dec 2018 04:54:53 GMT
I'm working on an ETL that requires me to import a continuous stream of CSVs into hadoop /
hive cluster. For now let's assume the CSVs need to end up in the same database.table. But
the newer CSVs might introduce additional columns (hence I want the script to alter the table
and add additional columns as it encounters them).



e.g.



csv1.csv

a,b

1,2

2,4



csv2.csv

a,b,c

3,8,0

4,10,2



what is the best way to write such ETL into hive.  should I use hive with -f to spin up scripts
like:


upsert.hql:

CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED FIELDS TERMINATED
BY '\,';

SET hive.cli.errors.ignore=true;

ALTER TABLE mydbname.testtable ADD COLUMNS (b string);

SET hive.cli.errors.ignore=false;

LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE mydbname.testtable;



(disadvantage is that when LAD DATA encounters invalid column string for integer field the
value NULL is inserted and I do not get notified)

should I do it from beeline?

should I write a pig script?

should I write a java program?


should I use programs like: https://github.com/enahwe/Csv2Hive


what's the recommended approach here?


Mime
View raw message