hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Koifman <ekoif...@hortonworks.com>
Subject Re: populating Hive table periodically from files on HDFS
Date Mon, 26 Sep 2016 03:50:49 GMT
Have you considered Hive Streaming?  (https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
It's built for exactly such use case.
Both Flume and Storm are integrated with it and write directly to your target table.

Eugene

From: Mich Talebzadeh <mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Sunday, September 25, 2016 at 8:47 AM
To: user <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: populating Hive table periodically from files on HDFS

Thanks

I agree I think using INSERT OWERWRITE to repopulate data in the partition is bullet proof
with nothing left behind. Performance looks good as well.

When creating partitions by date it seems to be more effective to partition by a single string
of 'YYYY-MM-DD' rather than use a multi-depth partition Year, Months, Days  etc.

I thought about bucketing the partitions but one needs to balance the housekeeping with the
number of buckets within each partition. So I did not bother.

Cheers




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content
is explicitly disclaimed. The author will in no case be liable for any monetary damages arising
from such loss, damage or destruction.



On 25 September 2016 at 12:19, Jörn Franke <jornfranke@gmail.com<mailto:jornfranke@gmail.com>>
wrote:
I think what you propose makes sense. If you would do a delta load you gain not much performance
benefits (most likely you will have less performance because you need to figure out what has
changed, have the typical issues of distributed systems that some changes may arrive later,
error handling etc). Especially given the volumes.
You may partition smaller, but yes given the volumes not really needed.

On 25 Sep 2016, at 12:32, Mich Talebzadeh <mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>>
wrote:


Hi,

I have trade data delivered through kafka and flume as csv files to HDFS. There are 100 prices
every 2 seconds so in a minute there are 3000 new rows, 18K rows an hour and in a day 4,320,000
new rows.

Flume creates a new sub directory partition ever day in the format YYYY-MM-DD like prices/2015-09-25
on HDFS

There is an external Hive table pointing to new directory by simply altering external table
location

ALTER TABLE ${DATABASE}.externalMarketData set location 'hdfs://rhes564:9000/data/prices/${TODAY}';

This means that the external Hive table only points to the current directory.

The target internal table in Hive is partitioned by  DateStamp ="YYYY-MM-DD"

PARTITIONED BY (DateStamp  string)

to populate the Hive table a cron job runs every 15 minutes and does simply

INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = "${TODAY}")
SELECT
'''''''''''''''''''''''''
)
FROM ${DATABASE}.externalMarketData

So effectively every 15 minutes today's partition is overwritten by new data from the external
table.

This seems to be OK.

The other option is only add new rows since last time with INSERT INTO WHERE rows do not exist
in target table.

Any other suggestions?


Thanks










Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content
is explicitly disclaimed. The author will in no case be liable for any monetary damages arising
from such loss, damage or destruction.




Mime
View raw message