hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: populating Hive table periodically from files on HDFS
Date Sun, 25 Sep 2016 11:19:16 GMT
I think what you propose makes sense. If you would do a delta load you gain not much performance
benefits (most likely you will have less performance because you need to figure out what has
changed, have the typical issues of distributed systems that some changes may arrive later,
error handling etc). Especially given the volumes.
You may partition smaller, but yes given the volumes not really needed. 

> On 25 Sep 2016, at 12:32, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> 
> Hi,
> 
> I have trade data delivered through kafka and flume as csv files to HDFS. There are 100
prices every 2 seconds so in a minute there are 3000 new rows, 18K rows an hour and in a day
4,320,000 new rows.
> 
> Flume creates a new sub directory partition ever day in the format YYYY-MM-DD like prices/2015-09-25
on HDFS
> 
> There is an external Hive table pointing to new directory by simply altering external
table location
> 
> ALTER TABLE ${DATABASE}.externalMarketData set location 'hdfs://rhes564:9000/data/prices/${TODAY}';
> 
> This means that the external Hive table only points to the current directory.
> 
> The target internal table in Hive is partitioned by  DateStamp ="YYYY-MM-DD"
> 
> PARTITIONED BY (DateStamp  string)
> 
> to populate the Hive table a cron job runs every 15 minutes and does simply
> 
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = "${TODAY}")
> SELECT
> '''''''''''''''''''''''''
> )
> FROM ${DATABASE}.externalMarketData
> 
> So effectively every 15 minutes today's partition is overwritten by new data from the
external table.
> 
> This seems to be OK. 
> 
> The other option is only add new rows since last time with INSERT INTO WHERE rows do
not exist in target table.
> 
> Any other suggestions?
> 
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  

Mime
View raw message