hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables
Date Wed, 24 Aug 2016 21:27:13 GMT
This is also a good option.

With respect to Hive transactional tables: I do to think they have been designed for massive
inserts of single items. On the other hand you would not insert a lot of events using single
inserts in a relational database. Same restrictions apply, it is not the use case you want
to implement.


> On 24 Aug 2016, at 13:55, Kit Menke <kitmenke@gmail.com> wrote:
> 
> Joel,
> Another option which you have is to use the Storm HDFS bolt to stream data into Hive
external tables. The external tables then get loaded into ORC history tables for long term
storage. We use this in a HDP cluster with similar load so I know it works. :)
> 
> I'm with Jörn on this one. My impression of hive transactions is that it is a new feature
not totally ready for production.
> Thanks,
> Kit
> 
> 
>> On Aug 24, 2016 3:07 AM, "Joel Victor" <joelsvictor@gmail.com> wrote:
>> @Jörn: If I understood correctly even later versions of Hive won't be able to handle
these kinds of workloads?
>> 
>>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>>> I think Hive especially these old versions have not been designed for this. Why
not store them in Hbase and run a oozie job regularly that puts them all into Hive /Orc or
parquet in a bulk job?
>>> 
>>>> On 24 Aug 2016, at 09:35, Joel Victor <joelsvictor@gmail.com> wrote:
>>>> 
>>>> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are trying
perform streaming ingestion with it.
>>>> We are using the Storm Hive bolt and we have 7 tables in which we are trying
to insert. The RPS (requests per second) of our bolts ranges from 7000 to 5000 and our commit
policies are configured accordingly i.e 100k events or 15 seconds.
>>>> 
>>>> We see that there are many commitTxn exceptions due to serialization errors
in the metastore (we are using PostgreSQL 9.5 as metastore)
>>>> The serialization errors will cause the topology to start lagging in terms
of events processed as it will try to reprocess the batches that have failed.
>>>> 
>>>> I have already backported this HIVE-10500 to 0.14 and there isn't much improvement.
>>>> I went through most of the JIRA's about transaction and I found the following
HIVE-11948, HIVE-13013. I would like to backport them to 0.14.
>>>> Going through the patches gives me an impression that I need to mostly update
the queries and transaction levels.
>>>> Do these patches also require me to update the schema in the metastore? Please
also let me know if there are any other patches that I missed.
>>>> 
>>>> I would also like to know whether Apache Hive can handle inserts to the same/different
tables concurrently from multiple clients in 1.2.1 or later versions without many serialization
errors in Hive metastore?
>>>> 
>>>> -Joel

Mime
View raw message