hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Victor <joelsvic...@gmail.com>
Subject Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables
Date Wed, 24 Aug 2016 14:38:51 GMT
Thanks Kit ! We are trying HDFS bolt with external tables now.

On Wed, Aug 24, 2016 at 5:25 PM, Kit Menke <kitmenke@gmail.com> wrote:

> Joel,
> Another option which you have is to use the Storm HDFS bolt to stream data
> into Hive external tables. The external tables then get loaded into ORC
> history tables for long term storage. We use this in a HDP cluster with
> similar load so I know it works. :)
>
> I'm with Jörn on this one. My impression of hive transactions is that it
> is a new feature not totally ready for production.
> Thanks,
> Kit
>
> On Aug 24, 2016 3:07 AM, "Joel Victor" <joelsvictor@gmail.com> wrote:
>
>> @Jörn: If I understood correctly even later versions of Hive won't be
>> able to handle these kinds of workloads?
>>
>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfranke@gmail.com>
>> wrote:
>>
>>> I think Hive especially these old versions have not been designed for
>>> this. Why not store them in Hbase and run a oozie job regularly that puts
>>> them all into Hive /Orc or parquet in a bulk job?
>>>
>>> On 24 Aug 2016, at 09:35, Joel Victor <joelsvictor@gmail.com> wrote:
>>>
>>> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are
>>> trying perform streaming ingestion with it.
>>> We are using the Storm Hive bolt and we have 7 tables in which we are
>>> trying to insert. The RPS (requests per second) of our bolts ranges from
>>> 7000 to 5000 and our commit policies are configured accordingly i.e 100k
>>> events or 15 seconds.
>>>
>>> We see that there are many commitTxn exceptions due to serialization
>>> errors in the metastore (we are using PostgreSQL 9.5 as metastore)
>>> The serialization errors will cause the topology to start lagging in
>>> terms of events processed as it will try to reprocess the batches that have
>>> failed.
>>>
>>> I have already backported this HIVE-10500
>>> <https://issues.apache.org/jira/browse/HIVE-10500> to 0.14 and there
>>> isn't much improvement.
>>> I went through most of the JIRA's about transaction and I found the
>>> following HIVE-11948 <https://issues.apache.org/jira/browse/HIVE-11948>
>>> , HIVE-13013 <https://issues.apache.org/jira/browse/HIVE-13013>. I
>>> would like to backport them to 0.14.
>>> Going through the patches gives me an impression that I need to mostly
>>> update the queries and transaction levels.
>>> Do these patches also require me to update the schema in the metastore?
>>> Please also let me know if there are any other patches that I missed.
>>>
>>> I would also like to know whether Apache Hive can handle inserts to the
>>> same/different tables concurrently from multiple clients in 1.2.1 or later
>>> versions without many serialization errors in Hive metastore?
>>>
>>> -Joel
>>>
>>>
>>

Mime
View raw message