hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok Kumar <ashok34...@yahoo.com>
Subject Re: Immutable data in Hive
Date Mon, 04 Jan 2016 20:06:11 GMT
I second that. Many thanks Mich for your reply.
Regards 

    On Monday, 4 January 2016, 10:58, "Singh, Abhijeet" <absingh@informatica.com> wrote:
 

 #yiv5347372295 #yiv5347372295 -- _filtered #yiv5347372295 {font-family:Calibri;panose-1:2
15 5 2 2 2 4 3 2 4;} _filtered #yiv5347372295 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4
4 2 4;} _filtered #yiv5347372295 {}#yiv5347372295 #yiv5347372295 p.yiv5347372295MsoNormal,
#yiv5347372295 li.yiv5347372295MsoNormal, #yiv5347372295 div.yiv5347372295MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5347372295
a:link, #yiv5347372295 span.yiv5347372295MsoHyperlink {color:blue;text-decoration:underline;}#yiv5347372295
a:visited, #yiv5347372295 span.yiv5347372295MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5347372295
p.yiv5347372295MsoAcetate, #yiv5347372295 li.yiv5347372295MsoAcetate, #yiv5347372295 div.yiv5347372295MsoAcetate
{margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv5347372295 p.yiv5347372295MsoListParagraph,
#yiv5347372295 li.yiv5347372295MsoListParagraph, #yiv5347372295 div.yiv5347372295MsoListParagraph
{margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv5347372295
p.yiv5347372295msonormal0, #yiv5347372295 li.yiv5347372295msonormal0, #yiv5347372295 div.yiv5347372295msonormal0
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5347372295 span.yiv5347372295EmailStyle19
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv5347372295
span.yiv5347372295BalloonTextChar {}#yiv5347372295 span.yiv5347372295EmailStyle22 {color:#1F497D;}#yiv5347372295
.yiv5347372295MsoChpDefault {font-size:10.0pt;} _filtered #yiv5347372295 {margin:1.0in 1.0in
1.0in 1.0in;}#yiv5347372295 div.yiv5347372295WordSection1 {}#yiv5347372295 Very well answered
by Mich.    Thanks Mich !!    From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Sunday, January 03, 2016 8:35 PM
To: user@hive.apache.org; 'Ashok Kumar'
Subject: RE: Immutable data in Hive    Hi Ashok.    I will have a go at this on top of Alan’s
very valuable clarification.    Extraction, Transformation and Load  (ETL) is a very common
method in Data Warehousing (DW)and Business Analytics projects and can be performed by custom
programming like writing shell scripts, JAVA. .NET tools or combination of all to get the
data from internal or external sources and put them in DW.    In general only data of valueends
up in DW. What this mean is that in say in Banking environment you collect and feed (Extract)
data into a staging area (in relational term often staging tables or the so called global
temporary tables that are cleared daily for the next cycle in a staging database), prune it
from unwanted data, do some manipulation (Transformation) (often happens into another set
of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts
then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember
a Data Warehouse is still a relational database most probably a columnar implementation of
relational model like SAP Sybase IQ.       There are many examples of DW repositories used
for Business Intelligence (BI, another fancy term for Analytics)  such as working out global
trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast
extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously
although DW store larger amount of data (large being a relative term) and have impressive
compression like Sybase IQ (every column is stored as an index so it is far more effective
to do columnar compression (all data being the same type as opposed to row compression in
OLTP databases)), they still require additional space, SAN storage and expensive horizontal
scaling (adding another multi-plex requires additional license).    ELT (Extraction, Load
and Transform) is a similar concept used in Big Data World. The fundamental difference being
that it is not just confined to data deemed to be of specific value, meaning you know what
you are looking for in advance. In Hadoop one can store everything from data coming from structured
data (transactional databases) and unstructured data (data coming from internet, excel sheets,
email, logs and others). This means that you can store potentially all data to be exploited
later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap
commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost
of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t
care whether that data is exactly what you want) into HDFS and then you do (T)ransformation
later through Schema on Read (you decide at time of exploration your data needs).    HDFS
is great for storing large amount of data but on top of that you will need all tools like
Hive, Spark, Cassandra and others to explore your data lake.       HTH       Mich Talebzadeh
   Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial
Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author"Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications
due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
   http://talebzadehmich.wordpress.com/    NOTE: The information in this email is proprietary
and confidential. This message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message shall not be
understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees,
unless expressly so stated. It is the responsibility of the recipient to ensure that this
email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees
accept any responsibility.    From: Ashok Kumar [mailto:ashok34668@yahoo.com]
Sent: 03 January 2016 11:03
To: user@hive.apache.org; Ashok Kumar <ashok34668@yahoo.com>
Subject: Re: Immutable data in Hive    Any comments on ELT will be greatly appreciated gurus.
   With warmest greetings    On Wednesday, 30 December 2015, 18:20, Ashok Kumar <ashok34668@yahoo.com>
wrote:    Tank you sir,  very helpful.   Could you also briefly describe from your experience 
the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis
to take data from traditional transactional databases into Hive table with the same format and
do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation?
  Regards         On Wednesday, 30 December 2015, 18:00, Alan Gates <alanfgates@gmail.com>
wrote:    Traditionally data in Hive was write once (insert) read many.  You could append
to tables and partitions, add new partitions, etc.  You could remove data by dropping tables
or partitions.  But there was no updates of data or deletes of particular rows.  This was
what was meant by immutable.  Hive was originally done this way because it was based on MapReduce
and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run
Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these
features.  Hive is being used more and more as a data warehouse, and while updates and deletes
are less common there they are still required (slow changing dimensions, fixing wrong data,
deleting records for compliance, etc.)  Also streaming data into warehouses from transactional
systems is a common use case.

Alan.

 
   Ashok Kumar December 29, 2015 at 14:59 Hi,    Can someone please clarify what  "immutable
data" in Hive means?    I have been told that data in Hive is/should be immutable but in
that case why we need transactional tables in Hive that allow updates to data.    thanks
and greetings             
      

  
Mime
View raw message