hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Greenhall <>
Subject Recommended approaches for large data import from RDBMS into hive (Approx Terabyte of data)
Date Wed, 23 Mar 2011 10:29:37 GMT
My desire is to be able to run a query on the existing RDBMS (SQLServer) to
de-normalise the required data into single rows that can then be imported
into a hive table partitioned by date.

Previously, on a much smaller scale, I have achieved data import into Hive
by copying a tsv of my data (already partitioned) into HDFS and then loading
into Hive, for example.

LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_view

I have considered the following options and was looking for feedback or more
effective solutions.

1) Create a data file for each day (we have over a years worth of data) and
then import as above.  On a large scale this is likely to be rather
cumbersome, although it could be scripted.

2) Use sqoop to import all the data into a stageing table and then use the
dynamic partition insert.  Unfortunately Sqoop does not allow (as far as I
can see) import into partitions.

I am not sure how this is going to work over such a large dataset as I have
seen examples of a staging table with 900,000 rows of data fail when a
dynamic partition import is applied.

Thanks in advance


View raw message