hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mirko Kämpf <mirko.kae...@gmail.com>
Subject Re: Add few record(s) to a Hive table or a HDFS file on a daily basis
Date Mon, 10 Feb 2014 10:44:42 GMT
Hi Raj,

there is no way of adding new data to a file in HDFS as long as the append
functionality is not available. Adding new "records" to a Hive table means,
creating a new file with those records. You do this in the "staging" table
which might be inefficient for large data sets especially if you run MR
jobs on it. After two years you will see more than 700 files.
To have all records in one file, you run an aggregation procedure with the
select command you mentioned. Select (*) reads all small files and
depending on the number of reducers running (should be only one in this
case) only one file will contain all records for the "finaltable". The same
could be done with a MR job which has the identity mapper and the identiy
reducer and numberRedurcers = 1.
Populating the staging table means just add the new file with the new
records each day to the HDFS-folder, which contains
the table data.

Best wishes

2014-02-10 3:45 GMT+01:00 Raj Hadoop <hadoopraj@yahoo.com>:

> Hi,
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>  insert overwrite table finaltable select * from staging;
>  I am not getting this logic. How should I populate the staging table
> daily.
>  Thanks,
>  Raj

View raw message