hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Singh, Abhijeet" <>
Subject RE: Immutable data in Hive
Date Mon, 04 Jan 2016 10:57:52 GMT
Very well answered by Mich.

Thanks Mich !!

From: Mich Talebzadeh []
Sent: Sunday, January 03, 2016 8:35 PM
To:; 'Ashok Kumar'
Subject: RE: Immutable data in Hive

Hi Ashok.

I will have a go at this on top of Alan’s very valuable clarification.

Extraction, Transformation and Load  (ETL) is a very common method in Data Warehousing (DW)
and Business Analytics projects and can be performed by custom programming like writing shell
scripts, JAVA. .NET tools or combination of all to get the data from internal or external
sources and put them in DW.

In general only data of value ends up in DW. What this mean is that in say in Banking environment
you collect and feed (Extract) data into a staging area (in relational term often staging
tables or the so called global temporary tables that are cleared daily for the next cycle
in a staging database), prune it from unwanted data, do some manipulation (Transformation)
(often happens into another set of staging tables) and finally Load it into target tables
in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic
trend in the data. Remember a Data Warehouse is still a relational database most probably
a columnar implementation of relational model like SAP Sybase IQ.

There are many examples of DW repositories used for Business Intelligence (BI, another fancy
term for Analytics)  such as working out global trading positioning (I did one of these by
bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic
trading using Complex Event Processing. Obviously although DW store larger amount of data
(large being a relative term) and have impressive compression like Sybase IQ (every column
is stored as an index so it is far more effective to do columnar compression (all data being
the same type as opposed to row compression in OLTP databases)), they still require additional
space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional

ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental
difference being that it is not just confined to data deemed to be of specific value, meaning
you know what you are looking for in advance. In Hadoop one can store everything from data
coming from structured data (transactional databases) and unstructured data (data coming from
internet, excel sheets, email, logs and others). This means that you can store potentially
all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal
scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing
Total Cost of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data
as is (don’t care whether that data is exactly what you want) into HDFS and then you do
(T)ransformation later through Schema on Read (you decide at time of exploration your data

HDFS is great for storing large amount of data but on top of that you will need all tools
like Hive, Spark, Cassandra and others to explore your data lake.


Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly<>

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

From: Ashok Kumar []
Sent: 03 January 2016 11:03
To:<>; Ashok Kumar <<>>
Subject: Re: Immutable data in Hive

Any comments on ELT will be greatly appreciated gurus.

With warmest greetings

On Wednesday, 30 December 2015, 18:20, Ashok Kumar <<>>

Tank you sir,  very helpful.

Could you also briefly describe from your experience  the major differences between traditional
ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional
databases into Hive table with the same format and do the transform in Hive after. Is it because
Hive is meant to be efficient in data transformation?


On Wednesday, 30 December 2015, 18:00, Alan Gates <<>>

Traditionally data in Hive was write once (insert) read many.  You could append to tables
and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.
 But there was no updates of data or deletes of particular rows.  This was what was meant
by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS
and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run
Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these
features.  Hive is being used more and more as a data warehouse, and while updates and deletes
are less common there they are still required (slow changing dimensions, fixing wrong data,
deleting records for compliance, etc.)  Also streaming data into warehouses from transactional
systems is a common use case.


Ashok Kumar<>
December 29, 2015 at 14:59

Can someone please clarify what  "immutable data" in Hive means?

I have been told that data in Hive is/should be immutable but in that case why we need transactional
tables in Hive that allow updates to data.

thanks and greetings

View raw message